One of the most common challenges in AI domain ML workloads is the need of expensive GPU computes which are typically used for Deep Neural networks. The reason why this is challenging is because of availability of GPU based computes in all regions and this kind of dependencies can lead to delays in productionizing such workloads. Additionally, most edge type devices do not carry GPUs generally.
This notebook features sophisticated Computer Vision techniques that enable getting good enough ML model performance from niche healthcare datasets like in this case we are using a Kaggle competition data to detect PENUMONIA. We do a binary classification of Chest X-ray images with a high Accuracy & Recall/Sensitivity.
The goal of this effort is to showcase the fact that near-ideal ML training and inferencing can be done using CPU based ACC Confidential computes like Confidential VMs and Confidential ACI.
The data source is https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia.
This notebook helps us do chest X-Ray classification to identify an image as PNEUMONIA vs NORMAL.
Other datasets that also can be showcased as well:
https://www.kaggle.com/datasets/purna135/chest-xray-dataset
https://www.kaggle.com/datasets/nih-chest-xrays/data
Here is a research paper that was referenced and enhanced in the following work.
https://ieeexplore.ieee.org/document/9297608
Content-Based Image Retrieval is implemented using color histogram, Localized Binary Pattern, and Histogram calculated from oriented Gradients. The implementation consists of three steps preprocessing, feature extraction, classification. When recovering images based on content, extraction highlighting is an amazing test task. In this article, the highlights of a histogram, a local binary pattern to highlight surface components, and an ordered slope histogram to highlight inclusions in a shape. The machine classifier of the reinforcement vector is used for grouping. Research results show that a combination of each of the three key points is superior to a single element or a combination of two-component recovery methods.
Following up on the above work, I have combined Contours & Edges as well to achieve even better results for the dataset.
Aside, from using this Combo feature as described above, we have also compared against static features extracted from VGG16 & Resnet50 pre-trained models.
Neural model FINETUNING will be added in the future and may be more compute and memory intensive
Batching of feature extraction will be done in the future as well to ensure we can further minimize the compute requirement.
import glob
import cv2
import os
import imageio
import time
import requests
import sys
import numpy as np
from skimage.feature import hog
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import make_scorer, accuracy_score, roc_auc_score, precision_score, recall_score, f1_score, average_precision_score
from keras.models import Model
import optuna
from sklearn.metrics import confusion_matrix, roc_curve, auc
from scipy import ndimage
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
import seaborn as sns
from skimage.feature import hog
from xgboost import XGBClassifier
from tqdm import tqdm
from skimage.color import rgb2gray
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from keras.preprocessing import image
import shutil
from skimage.feature import local_binary_pattern, hog
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from numpy.linalg import svd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import ParameterGrid, GridSearchCV , KFold, StratifiedKFold
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score
from keras.applications.resnet50 import ResNet50, preprocess_input as resnet50_preprocess_input
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, confusion_matrix
from itertools import cycle
from sklearn.metrics import make_scorer
from sklearn.preprocessing import LabelBinarizer
from scipy.ndimage import median_filter
import pandas as pd
from sklearn.model_selection import train_test_split
import warnings
from sklearn.exceptions import ConvergenceWarning
import traceback
from sklearn.exceptions import FitFailedWarning
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif
from matplotlib import pyplot as plt
from skimage.feature import hog, local_binary_pattern
from keras.applications.vgg16 import VGG16, preprocess_input as vgg16_preprocess_input
from sklearn.manifold import TSNE
from keras.preprocessing import image
from keras.utils import get_file
from sklearn.svm import LinearSVC
from keras import backend as K
import tensorflow as tf
import tracemalloc
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import cross_val_score
There are 2 steps before we will be able to run the following funciton.
OPTIONAL
Further to the above 2, in this section we will get the data loaded into lists from the location where the data was stored (optionally) post augmentation. We took the approach of augmentation becuase the data across the 2 classes was slightly imbalanced.
As stated earlier, one of the classes PNEUMONIA for example had 60% of the images from the ~5000 image dataset. Please look at the appendix section to find the Image augmentation code.
Metrics Used We are using AUC, Accuracy, Precision & Recall as our main metrics. In class imbalance situations generally AURPC is recommended but with data augmentation the hypothesis is that we are able to handle that.
We assume that the data augmentation techniques used are the most applicable ones to this scenario.
We have very limited Confidential computing resources to run all types of experiments we would want to run and scenarios we will test. The idea is to get >70% Accuracy for example with simplest and light weight features in as less time as possible. Grid Search requirement may be hard to optimize. But we will do it.
This section counts the images in the Training, Test & Validation
## This section just checks the number of images per class post or pre augmentation
# Post review Rachel suggested that we don't need to augment the Test & Validation & we made that change
def howmanyimages(main_folder = './chest_xray'):
"""
This function counts files
"""
subfolders = ['Training', 'Test', 'Validation']
class_names = ['NORMAL', 'PNEUMONIA']
fig, axs = plt.subplots(1, len(subfolders), figsize=(15, 5))
for i, subfolder in enumerate(subfolders):
files_per_subfolder = {}
for class_name in class_names:
subfolder_path = os.path.join(main_folder, subfolder , class_name)
if os.path.isdir(subfolder_path):
files_per_subfolder[class_name] = len([name for name in os.listdir(subfolder_path)
if os.path.isfile(os.path.join(subfolder_path, name))])
print("For ", subfolder)
print(f'Number of files per class in the main folder: {files_per_subfolder}.\
We will pick a few out of these for training, test & validation')
# Create a bar chart for the current subfolder
axs[i].bar(files_per_subfolder.keys(), files_per_subfolder.values())
axs[i].set_title(subfolder)
# Display the charts
plt.show()
howmanyimages()
For Training
Number of files per class in the main folder: {'NORMAL': 1795, 'PNEUMONIA': 5107}. We will pick a few out of these for training, test & validation
For Test
Number of files per class in the main folder: {'NORMAL': 301, 'PNEUMONIA': 515}. We will pick a few out of these for training, test & validation
For Validation
Number of files per class in the main folder: {'NORMAL': 10, 'PNEUMONIA': 9}. We will pick a few out of these for training, test & validation
def get_data(main_folder, subfolders, class_names, num_images_per_class, whetherall = False):
"""
Get the data and return Training, Test & Validation image set paths
"""
X_train = []
y_train = []
X_test = []
y_test = []
X_val = []
y_val = []
for subfolder in subfolders:
for class_name in tqdm(class_names):
class_folder = os.path.join(main_folder, subfolder, class_name)
#print(class_folder)
image_count = 0
for file_name in os.listdir(class_folder):
if file_name.endswith('.jpeg'):
file_path = os.path.join(class_folder, file_name)
if subfolder == 'Training':
X_train.append(file_path)
y_train.append(class_name)
elif subfolder == 'Test':
X_test.append(file_path)
y_test.append(class_name)
elif subfolder == 'Validation':
X_val.append(file_path)
y_val.append(class_name)
image_count += 1
if not whetherall:
if image_count >= num_images_per_class:
break
return X_train, y_train, X_test, y_test, X_val, y_val
def visualize_features():
"""
This function visualizes and explain the reasons why we chose the features we chose to model
The visualization shows that the combination for 5 types of features showing texture, color depth,
histogram gradients, contours & edges are a good combination for skin cancer image classification.
Texture captures local spatial arrangement of pixel intensities
Color depth captures distribution of color values which can distinguish between types of lesions
HOG can detect boundaries and edges
Contours & Edges can detect shapes and outlines again to distinguish between classes
#### IMPORTANT NOTE; The current approach does not do any finetuning on the VGG16 or Resnet50/Resnet101 models
### FINETUNING will be added in the future and may be more compute and memory intensive
"""
main_folder_path = '.\chest_xray'
subfolders = ['Training', 'Test', 'Validation']
class_names = ['NORMAL','PNEUMONIA']
X_train, _,_,_,_,_ = get_data(main_folder_path, subfolders, class_names,
1)
#X_train_filtered = [file for file in X_train if "original" in file]
X_train_filtered = X_train
#print(X_train)
# Select the first file from the filtered list of name containing Original word
file_path = X_train_filtered[0]
print("File image path", file_path)
## Plotting all complex features for one given image.
print()
print("Feature Set A")
print()
# Select the first file from the filtered list of name containing Original word
file_path = X_train_filtered[0]
image_c = cv2.imread(file_path)
# Convert the image to grayscale
gray = cv2.cvtColor(image_c, cv2.COLOR_BGR2GRAY)
# Compute the color histogram for each channel
color = ('b', 'g', 'r')
histograms = []
for i, col in enumerate(color):
hist = cv2.calcHist([image_c], [i], None, [256], [0, 256])
histograms.append(hist)
# Compute the LBP
lbp = local_binary_pattern(gray, 8, 1)
# Compute the HOG
fd, hog_image = hog(gray, visualize=True)
# Compute the contours and edges
edges = cv2.Canny(gray, 100, 200)
contours, _ = cv2.findContours(edges, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
# Apply a Gaussian filter to the edges
edges_gaussian = cv2.GaussianBlur(edges, (5, 5), 0)
# Draw the contours on the original image using green lines
image_with_edges = cv2.drawContours(image_c.copy(), contours, -1, (0, 255, 0), 3)
# Compute the Laplacian edges
laplacian_edges = cv2.Laplacian(gray,cv2.CV_64F)
# Display the image and features
fig = plt.figure(figsize=(10, 10))
plt.subplots_adjust(hspace=0.4, wspace=0.4)
plt.subplot(331), plt.imshow(cv2.cvtColor(image_with_edges,cv2.COLOR_BGR2RGB)),
plt.title('Original with Edges')
plt.subplot(332), plt.imshow(gray,cmap='gray'), plt.title('Grayscale')
# Display the histogram (Bar)
plt.subplot(333), plt.title('Color Histogram (Bar)')
for i in range(3):
plt.bar(np.arange(256) + i * 0.25,histograms[i].ravel(),width=0.25,color=color[i])
plt.subplot(334), plt.imshow(lbp,cmap='gray'), plt.title('LBP')
plt.subplot(335), plt.imshow(hog_image,cmap='gray'), plt.title('HOG')
plt.subplot(336), plt.imshow(cv2.drawContours(image_c.copy(),contours,-1,(0,255,0),3)),
plt.title('Contours with Green Lines')
plt.subplot(337), plt.imshow(edges_gaussian,cmap='gray'), plt.title('Edges with Gaussian Filter')
plt.subplot(338), plt.imshow(laplacian_edges,cmap='gray'), plt.title('Laplacian Edges')
plt.show()
print("Feature Set B")
print()
img_original = image.load_img(file_path)
img = cv2.imread(file_path)
plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB)), plt.title('Original image of class NORMAL')
# Convert the image to a numpy array and preprocess it
x_original = image.img_to_array(img_original)
x_original = np.expand_dims(x_original, axis=0)
x_vgg16_original = vgg16_preprocess_input(x_original)
url = 'https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5'
m_file_path = './vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5'
# Download the weights file
response = requests.get(url, verify=False)
with open(m_file_path, 'wb') as f:
f.write(response.content)
# Load the VGG16 model without specifying any layer
model_vgg16 = VGG16(weights=m_file_path, include_top=False)
features_vgg16_original = model_vgg16.predict(x_vgg16_original, verbose=0)
# Load the image and resize it to 224x224
img_resized = image.load_img(file_path, target_size=(224, 224))
# Convert the image to a numpy array and preprocess it
x_resized = image.img_to_array(img_resized)
x_resized = np.expand_dims(x_resized, axis=0)
x_vgg16_resized = vgg16_preprocess_input(x_resized)
x_resnet50_resized = resnet50_preprocess_input(x_resized)
url = 'https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels.h5'
m_rfile_path = './vgg16_weights_tf_dim_ordering_tf_kernels.h5'
# Download the weights file
response = requests.get(url, verify=False)
with open(m_rfile_path, 'wb') as f:
f.write(response.content)
# Load the pre-trained VGG16 model with the downloaded weights
base_model = VGG16(weights=m_rfile_path)
# Create a new model that outputs the features from the specified layer
model_vgg16_l = Model(inputs=base_model.input, outputs=base_model.get_layer('block5_conv3').output)
# Load the VGG16 model with specifying a layer block5_conv3
features_vgg16_resized = model_vgg16_l.predict(x_vgg16_resized, verbose=0)
# Load the ResNet50 model with layer conv5_block3_out
urlr = 'https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels.h5'
file_path_r = './resnet50_weights_tf_dim_ordering_tf_kernels.h5'
# Download the weights file
responser = requests.get(urlr, verify=False)
with open(file_path_r, 'wb') as fr:
fr.write(responser.content)
model_resnet50_conv5_block3_out = ResNet50(weights=file_path_r) #'imagenet'
model_resnet50_conv5_block3_out = Model(inputs=model_resnet50_conv5_block3_out.input,
outputs=model_resnet50_conv5_block3_out.get_layer('conv5_block3_out').output)
features_resnet50_conv5_block3_out = model_resnet50_conv5_block3_out.predict(x_resnet50_resized, verbose=0)
# Plot the extracted features
plt.figure(figsize=(10, 10))
plt.subplot(321), plt.imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB)), plt.title('Original')
plt.subplot(322), plt.imshow(features_vgg16_original[0, :, :, 0], cmap='gray'), plt.title('VGG16 Original Size')
plt.subplot(323), plt.imshow(features_vgg16_resized[0, :, :, 0], cmap='gray'), plt.title('VGG16 Resized')
plt.subplot(324), plt.imshow(features_resnet50_conv5_block3_out[0, :, :, 0], cmap='gray'),
plt.title('ResNet50 Resized')
plt.show()
# Extract the CNN features without resizing the image
model = VGG16(weights=m_file_path, include_top=False)
img = image.load_img(file_path, target_size=(600, 450))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = vgg16_preprocess_input(img_data)
cnn_feature = model.predict(img_data, verbose=0).flatten()
# Extract the CNN features with resizing to 224x224 using VGG16
base_model = VGG16(weights=m_rfile_path)
model = Model(inputs=base_model.input, outputs=base_model.get_layer('block5_conv3').output)
img = image.load_img(file_path, target_size=(224, 224))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = vgg16_preprocess_input(img_data)
cnn_feature224_vgg = model.predict(img_data, verbose=0).flatten()
# Extract the CNN features with resizing to 224x224 using ResNet50
base_model = ResNet50(weights=file_path_r)
layer_name = 'conv5_block3_out'
model = Model(inputs=base_model.input, outputs=base_model.get_layer(layer_name).output)
img = image.load_img(file_path, target_size=(224, 224))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = resnet50_preprocess_input(img_data)
cnn_feature_resnet = model.predict(img_data, verbose=0).flatten()
print()
# Plot the CNN features without resizing the image
plt.figure(figsize=(10, 10))
for i in range(64):
plt.subplot(8, 8, i + 1)
plt.imshow(cnn_feature.reshape((18, 14, 512))[:, :, i])
plt.axis('off')
plt.suptitle('CNN Features (No Resize) 600x450')
plt.show()
print()
# Plot the CNN features with resizing to 224x224 using VGG16
plt.figure(figsize=(10, 10))
for i in range(64):
plt.subplot(8, 8, i + 1)
plt.imshow(cnn_feature224_vgg.reshape((14, 14, 512))[:, :, i])
plt.axis('off')
plt.suptitle('CNN Features (VGG16) 224x224')
plt.show()
print()
# Plot the CNN features with resizing to 224x224 using ResNet50
plt.figure(figsize=(10, 10))
for i in range(64):
plt.subplot(8, 8, i + 1)
plt.imshow(cnn_feature_resnet.reshape((7, 7, 2048))[:, :, i])
plt.axis('off')
plt.suptitle('CNN Features (ResNet50) 224x224')
plt.show()
fig = plt.figure(figsize=(15, 15))
plt.subplots_adjust(hspace=0.4, wspace=0.4)
for i, class_name in enumerate(class_names):
X_train, _,_,_,_,_ = get_data(main_folder_path, subfolders, [class_name], 100)
#print(X_train)
X_train_filtered = X_train
file_path = X_train_filtered[0]
#print(file_path)
image_n = cv2.imread(file_path)
gray = cv2.cvtColor(image_n, cv2.COLOR_BGR2GRAY)
# Compute the color histogram for each channel
color = ('b', 'g', 'r')
histograms = []
for j, col in enumerate(color):
hist = cv2.calcHist([image_n], [j], None, [256], [0, 256])
histograms.append(hist)
# Compute the LBP
lbp = local_binary_pattern(gray, 8, 1)
# Compute the HOG
fd, hog_image = hog(gray, visualize=True)
# Compute the contours and edges
edges = cv2.Canny(gray, 100, 200)
contours, _ = cv2.findContours(edges, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
# Apply a Gaussian filter to the edges
edges_gaussian = cv2.GaussianBlur(edges, (5, 5), 0)
# Plot the image and features
plt.subplot(7, 6, i * 6 + 1), plt.imshow(cv2.cvtColor(image_n,cv2.COLOR_BGR2RGB)), plt.title(f'Original {class_name}')
plt.subplot(7, 6, i * 6 + 2), plt.imshow(gray,cmap='gray'), plt.title(f'Grayscale {class_name}')
# Display the histogram (Bar)
plt.subplot(7, 6, i * 6 + 3), plt.title(f'Color Histogram {class_name}')
for j in range(3):
plt.bar(np.arange(256) + j * 0.25,histograms[j].ravel(),width=0.25,color=color[j])
plt.subplot(7, 6, i * 6 + 4), plt.imshow(lbp,cmap='gray'), plt.title(f'LBP {class_name}')
plt.subplot(7, 6, i * 6 + 5), plt.imshow(hog_image,cmap='gray'), plt.title(f'HOG {class_name}')
plt.subplot(7, 6, i * 6 + 6), plt.imshow(edges_gaussian,cmap='gray'), plt.title(f'Edges with Gaussian Filter {class_name}')
plt.show()
#visualize_features()
In this function, we train, evaluate & plot various optimizer performances for our datasets
The optimizers used are:
Logistic Regression: This is a simple and fast linear classifier that works well for binary classification tasks. It can also be extended to multi-class classification using techniques such as one-vs-rest or softmax regression.
SGD: Stochastic Gradient Descent (SGD) is an optimization algorithm that can be used to train a wide variety of models, including linear classifiers such as logistic regression and support vector machines. It is an iterative method that updates the model’s parameters in small steps based on a random subset of the training data, making it well-suited for large-scale learning tasks.
Support Vector Machines/SVC: Support Vector Machines (SVMs) are powerful classifiers that can handle both linearly separable and non-linearly separable data by using kernel functions to map the data into a higher-dimensional space. They are effective in high-dimensional spaces and can be trained on smaller datasets.
Random Forests: Random Forests are an ensemble learning method that combines multiple decision trees to improve the accuracy and robustness of the model. They can handle both categorical and continuous features and are relatively easy to interpret.
Gaussian Naive Bayes: This is a simple probabilistic classifier based on Bayes’ theorem that assumes independence between the features. It is fast and easy to implement, and can work well in practice even if the independence assumption is not strictly met.
K Nearest Neighbors: This is a non-parametric instance-based learning algorithm that classifies new instances based on their similarity to the training instances. It is simple to implement and can handle multi-class classification tasks.
[Source of optimizer definitions: bing.com]
We have implemented grid search for all of them but the process is immensely compute intensive & not meant for a laptop. So, skipped post finding the best params.
Why were these choices made for this problem? For low scale Non-GPU class work these optimizers are a set of great choices. However, which one to apply to a particular problem is a decison we made as a team.
Here we have functions for plotting the results and Hyper Param Tuning
## HELPER FUNCTIONS
def plot_roc_curve(y_test, y_score, classes):
"""
This function plots ROC AUC curve for multiple algorithms/optimizers
"""
le = LabelEncoder()
y_test_encoded = le.fit_transform(y_test)
try:
fpr, tpr, _ = roc_curve(y_test_encoded, y_score)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = {0:0.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for binary data')
plt.legend(loc="lower right")
plt.show()
except Exception as e:
print("From the plot_roc_curve function", str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
def plot_confusion_matrix(y_test, y_pred, class_names):
"""
Visualizes the confusion matrix for multiple classes
"""
try:
cm = confusion_matrix(y_test, y_pred)
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
ax.figure.colorbar(im, ax=ax)
ax.set(xticks=np.arange(cm.shape[1]),
yticks=np.arange(cm.shape[0]),
xticklabels=class_names, yticklabels=class_names,
title='Normalized confusion matrix',
ylabel='True label',
xlabel='Predicted label')
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
fmt = '.2f'
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
ax.text(j, i, format(cm[i, j], fmt),
ha="center", va="center",
color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()
plt.show()
except Exception as e:
print("From the plot_confusion_matrix function",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
def evaluate_class(y_pred, y_test, class_name, class_names):
"""
This function evaluates one of the classes v/s others in terms of metrics
"""
try:
# Calculate accuracy, precision, and recall for each class
accuracies = []
precisions = []
recalls = []
for c in class_names:
accuracies.append(accuracy_score(y_test == c, y_pred == c))
precisions.append(precision_score(y_test == c, y_pred == c))
recalls.append(recall_score(y_test == c, y_pred == c))
# Reorder the lists to plot the first class first
first_class_idx = class_names.index(class_name)
class_names = [class_name] + class_names[:first_class_idx] + class_names[first_class_idx+1:]
accuracies = [accuracies[first_class_idx]] + accuracies[:first_class_idx] + accuracies[first_class_idx+1:]
precisions = [precisions[first_class_idx]] + precisions[:first_class_idx] + precisions[first_class_idx+1:]
recalls = [recalls[first_class_idx]] + recalls[:first_class_idx] + recalls[first_class_idx+1:]
bar_width = 0.20
r1 = np.arange(len(class_names))
r2 = [x + bar_width for x in r1]
r3 = [x + bar_width for x in r2]
plt.bar(r1, accuracies, width=bar_width, label="Accuracy")
plt.bar(r2, precisions, width=bar_width, label="Precision")
plt.bar(r3, recalls, width=bar_width, label="Recall")
plt.xticks([r + bar_width for r in range(len(class_names))], class_names)
plt.hlines(accuracies[0], xmin=r1[0]-bar_width/2, xmax=r3[-1]+bar_width/2,
linestyles='dotted', colors='darkgreen', label="First class accuracy")
plt.hlines(precisions[0], xmin=r1[0]-bar_width/2, xmax=r3[-1]+bar_width/2,
linestyles='dotted', colors='darkblue', label="First class precision")
plt.hlines(recalls[0], xmin=r1[0]-bar_width/2, xmax=r3[-1]+bar_width/2,
linestyles='dotted', colors='darkred', label="First class recall")
plt.legend()
plt.show()
except Exception as e:
print("From the evaluate_class function",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
## MAIN TRAINING & EVALUATION FUNCTION
def train_and_evaluate(X_train, y_train, X_test, y_test, X_val, y_val,
gridsrch= True, optslist = 'LR,CLF,SVM,KNN,NB,RF'):
"""
This is the main training and evaluation functions that get called with Train, test and validation
arrays and does an evaluation of all the different optimizers as can be seen below
"""
global class_names
# Create an imputer object that will replace missing values with the mean
imputer = SimpleImputer(strategy='mean')
# Fit the imputer on the training data
imputer.fit(X_train)
# Transform the training and test data to handle NaNs
X_train_imputed = imputer.transform(X_train)
X_test_imputed = imputer.transform(X_test)
X_val_imputed = imputer.transform(X_val)
"""print(f"No of Nan in X_train_imputed: { np.sum(np.isnan(X_train_imputed))}")
print(f"No of Nan in X_test_imputed: { np.sum(np.isnan(X_test_imputed))}")"""
"""main_folder = '.\chest_xray'
subfolders = ['Training']"""
class_names = ['NORMAL','PNEUMONIA']
start = time.time()
optlist = optslist.split(',')
# Ignore convergence warnings
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=FitFailedWarning)
warnings.filterwarnings("ignore")
print()
if 'LR' in optlist:
print()
print('Logistic Regression training:')
lr = LogisticRegression(max_iter= 1000, C = 100.0)
if gridsrch:
param_grid = {
'C': [100, 300, 50], #20, 30 ,50, 10000,0.1, 0.01, 1,
'penalty': ['l1', 'l2'], #
'solver': ['newton-cg', 'lbfgs'], #, 'liblinear', 'sag', 'saga''newton-cg',
'max_iter': [5000, 10000], #, 5000, 10000,, 10000 50000, 100000
'multi_class': ['ovr'] #'auto', 'multinomial'
}
grid = GridSearchCV(lr, param_grid, cv = 5, n_jobs=-1, verbose = 15)
grid.fit(X_train, y_train)
y_pred_lr = grid.predict(X_test)
y_pred_lr_val = grid.predict(X_val_imputed)
y_pred_lr_train = grid.predict(X_train_imputed)
best_params = grid.best_params_
# Print the best parameters
print("Best parameters: ", best_params)
else:
lr = LogisticRegression(max_iter=10000, C= 100.0) #, verbose = 1
lr.fit(X_train_imputed, y_train)
y_pred_lr = lr.predict(X_test_imputed)
y_pred_lr_val = lr.predict(X_val_imputed)
y_pred_lr_train = lr.predict(X_train_imputed)
print('Accuracy train:', accuracy_score(y_train, y_pred_lr_train))
print('Accuracy test:', accuracy_score(y_test, y_pred_lr))
print('Accuracy val:', accuracy_score(y_val, y_pred_lr_val))
try:
# compute the predicted labels
if gridsrch:
y_pred = np.argmax(grid.predict_proba(X_test_imputed), axis=1)
y_pred_v = np.argmax(grid.predict_proba(X_val_imputed), axis=1)
else:
y_pred = np.argmax(lr.predict_proba(X_test_imputed), axis=1)
y_pred_v = np.argmax(lr.predict_proba(X_val_imputed), axis=1)
if len(np.unique(y_test)) < 2 or len(np.unique(y_pred)) < 2 or \
len(np.unique(y_val)) < 2 or len(np.unique(y_pred_v)) < 2:
print("ROC AUC score may not be well-defined when not all classes \
are present in y_test or y_pred.")
else:
# compute the ROC AUC score
"""y_pred_proba = lr.predict_proba(X_test_imputed)
auc = roc_auc_score(y_test, y_pred_proba, multi_class='ovr')"""
if gridsrch:
class_index = list(grid.classes_).index('PNEUMONIA')
print('AUC ROC tests:', roc_auc_score(y_test, grid.predict_proba(X_test_imputed)[:, class_index]))
print('AUC ROC val:', roc_auc_score(y_val, grid.predict_proba(X_val_imputed)[:, class_index]))
plot_roc_curve(np.array(y_test), np.array(grid.predict_proba(X_test_imputed)[:, class_index]), class_names)
plot_roc_curve(np.array(y_val), np.array(grid.predict_proba(X_val_imputed)[:, class_index]), class_names)
else:
class_index = list(lr.classes_).index('PNEUMONIA')
print('AUC ROC tests:', roc_auc_score(y_test, lr.predict_proba(X_test_imputed)[:, class_index]))
print('AUC ROC val:', roc_auc_score(y_val, lr.predict_proba(X_val_imputed)[:, class_index]))
plot_roc_curve(np.array(y_test), np.array(lr.predict_proba(X_test_imputed)[:, class_index]), class_names)
plot_roc_curve(np.array(y_val), np.array(lr.predict_proba(X_val_imputed)[:, class_index]), class_names)
except Exception as e:
print("From the train_and_evaluate function & LR section",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-3:-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
#
print('Precision test:', precision_score(y_test, y_pred_lr, average='weighted'))
print('Recall test:', recall_score(y_test, y_pred_lr, average='weighted'))
print('Precision val:', precision_score(y_val, y_pred_lr_val, average='weighted'))
print('Recall val:', recall_score(y_val, y_pred_lr_val, average='weighted'))
try:
if 'PNEUMONIA' in np.unique(y_pred_lr_val): #np.unique(y_val) == 7 and
print()
print("Printing how this model fairs for class PNEUMONIA")
evaluate_class(y_pred_lr_val, y_val, 'PNEUMONIA', class_names)
except Exception as e:
print()
print("From the train_and_evaluate function & LR section",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
## Logistic Regression confusion matrix
plot_confusion_matrix(np.array(y_test), np.array(y_pred_lr), class_names)
## Plot all grid search results
if gridsrch:
print("Plot the gridsearch results for Logistic Regression")
results = pd.DataFrame(grid.cv_results_)
fig, axs = plt.subplots(2, 3, figsize=(15, 10))
# Plot mean test score as a function of C
axs[0, 0].plot(results['param_C'], results['mean_test_score'])
axs[0, 0].set_xlabel('C')
axs[0, 0].set_ylabel('Mean Test Score')
axs[0, 0].set_title('Logistic Regression Hyperparameter Tuning')
# Plot mean test score as a function of penalty
sns.boxplot(x='param_penalty', y='mean_test_score', data=results, ax=axs[0, 1])
axs[0, 1].set_xlabel('Penalty')
axs[0, 1].set_ylabel('Mean Test Score')
axs[0, 1].set_title('Logistic Regression Hyperparameter Tuning')
# Plot mean test score as a function of solver
sns.boxplot(x='param_solver', y='mean_test_score', data=results, ax=axs[0, 2])
axs[0, 2].set_xlabel('Solver')
axs[0, 2].set_ylabel('Mean Test Score')
axs[0, 2].set_title('Logistic Regression Hyperparameter Tuning')
# Plot mean test score as a function of max_iter
sns.boxplot(x='param_max_iter', y='mean_test_score', data=results, ax=axs[1, 0])
axs[1, 0].set_xlabel('Max Iterations')
axs[1, 0].set_ylabel('Mean Test Score')
axs[1, 0].set_title('Logistic Regression Hyperparameter Tuning')
# Plot mean test score as a function of multi_class
sns.boxplot(x='param_multi_class', y='mean_test_score', data=results, ax=axs[1, 1])
axs[1, 1].set_xlabel('Multi Class')
axs[1, 1].set_ylabel('Mean Test Score')
axs[1, 1].set_title('Logistic Regression Hyperparameter Tuning')
plt.tight_layout()
plt.show()
print()
print(f"Cumulative Time taken in seconds was {time.time() - start}")
if 'SGD' in optlist:
print()
print("Training with SGD Classifier..")
if gridsrch:
# SGD Classifier with logistic regression loss
clf = SGDClassifier(loss='log_loss', max_iter=300)
param_grid = {
'alpha': [0.01, 0.1, 1, 0.001], #, 10, 100
'penalty': ['l2'], #, 'l1', 'elasticnet'
'l1_ratio': [0.15], #np.linspace(0, 1, num=5),
'max_iter': [1500], #1000, 10000
'tol': np.logspace(-6, -3, num=3),
'learning_rate': ['optimal'], #, 'invscaling', 'adaptive'
'eta0': np.logspace(-5, -1, num=3),
'power_t': np.linspace(0.1, 1.0, num=5)
}
#param_grid_iterator = tqdm(list(ParameterGrid(param_grid)))
grid = GridSearchCV(clf, param_grid, cv = 3, n_jobs=-1, verbose = 15)
grid.fit(X_train_imputed, y_train)
y_pred_clf = grid.predict(X_test_imputed)
y_pred_clf_val = grid.predict(X_val_imputed)
y_pred_clf_train = grid.predict(X_train_imputed)
best_params = grid.best_params_
# Print the best parameters
print("Best parameters: ", best_params)
else:
clf = SGDClassifier(loss='log_loss', max_iter=1000, alpha= 1/100)
clf.fit(X_train_imputed, y_train)
y_pred_clf = clf.predict(X_test_imputed)
y_pred_clf_val = clf.predict(X_val_imputed)
y_pred_clf_train = clf.predict(X_train_imputed)
print()
print('SGD Classifier with logistic regression loss:')
print()
print('Accuracy train:', accuracy_score(y_train, y_pred_clf_train))
print('Accuracy test:', accuracy_score(y_test, y_pred_clf))
print('Accuracy val:', accuracy_score(y_val, y_pred_clf_val))
try:
# compute the predicted labels
if gridsrch:
y_pred = np.argmax(grid.predict_proba(X_test_imputed), axis=1)
y_pred_v = np.argmax(grid.predict_proba(X_val_imputed), axis=1)
else:
y_pred = np.argmax(clf.predict_proba(X_test_imputed), axis=1)
y_pred_v = np.argmax(clf.predict_proba(X_val_imputed), axis=1)
if len(np.unique(y_test)) < 2 or len(np.unique(y_pred)) < 2 or \
len(np.unique(y_val)) < 2 or len(np.unique(y_pred_v)) < 2:
print("ROC AUC score may not be well-defined when not all classes \
are present in y_test or y_pred.")
else:
# compute the ROC AUC score
"""y_pred_proba = lr.predict_proba(X_test_imputed)
auc = roc_auc_score(y_test, y_pred_proba, multi_class='ovr')"""
if gridsrch:
class_index = list(grid.classes_).index('PNEUMONIA')
print('AUC ROC tests:', roc_auc_score(y_test, grid.predict_proba(X_test_imputed)[:, class_index]))
print('AUC ROC val:', roc_auc_score(y_val, grid.predict_proba(X_val_imputed)[:, class_index]))
plot_roc_curve(np.array(y_test), np.array(grid.predict_proba(X_test_imputed)[:, class_index]), class_names)
plot_roc_curve(np.array(y_val), np.array(grid.predict_proba(X_val_imputed)[:, class_index]), class_names)
else:
class_index = list(clf.classes_).index('PNEUMONIA')
print('AUC ROC test:', roc_auc_score(y_test, clf.predict_proba(X_test_imputed)[:, class_index]))
print('AUC ROC val:', roc_auc_score(y_val, clf.predict_proba(X_val_imputed)[:, class_index]))
plot_roc_curve(np.array(y_test), np.array(clf.predict_proba(X_test_imputed)[:, class_index]), class_names)
plot_roc_curve(np.array(y_val), np.array(clf.predict_proba(X_val_imputed)[:, class_index]), class_names)
except Exception as e:
print("From the train_and_evaluate function & SGD section",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
print('Precision test:', precision_score(y_test, y_pred_clf, average='weighted'))
print('Recall test:', recall_score(y_test, y_pred_clf, average='weighted'))
print('Precision val:', precision_score(y_val, y_pred_clf_val, average='weighted'))
print('Recall val:', recall_score(y_val, y_pred_clf_val, average='weighted'))
try:
if 'PNEUMONIA' in np.unique(y_pred_clf_val): #np.unique(y_val) == 7 and
print()
print("Printing how this model fairs for class PNEUMONIA")
evaluate_class(y_pred_clf_val, y_val, 'PNEUMONIA', class_names)
except Exception as e:
print("From the train_and_evaluate function & SGD section",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
print()
#plot_roc_curve(np.array(y_test), np.array(clf.predict_proba(X_test_imputed)), class_names)
plot_confusion_matrix(np.array(y_test), np.array(y_pred_clf), class_names)
print()
if gridsrch:
print("Plot the gridsearch results for SGD")
print()
results = pd.DataFrame(grid.cv_results_)
fig, axs = plt.subplots(4, 2, figsize=(15, 20))
# Plot mean test score as a function of alpha
axs[0, 0].plot(results['param_alpha'], results['mean_test_score'])
axs[0, 0].set_xlabel('Alpha')
axs[0, 0].set_ylabel('Mean Test Score')
axs[0, 0].set_title('SGD Classifier Hyperparameter Tuning')
# Plot mean test score as a function of penalty
sns.boxplot(x='param_penalty', y='mean_test_score', data=results, ax=axs[0, 1])
axs[0, 1].set_xlabel('Penalty')
axs[0, 1].set_ylabel('Mean Test Score')
axs[0, 1].set_title('SGD Classifier Hyperparameter Tuning')
# Plot mean test score as a function of l1_ratio
sns.boxplot(x='param_l1_ratio', y='mean_test_score', data=results, ax=axs[1, 0])
axs[1, 0].set_xlabel('L1 Ratio')
axs[1, 0].set_ylabel('Mean Test Score')
axs[1, 0].set_title('SGD Classifier Hyperparameter Tuning')
# Plot mean test score as a function of max_iter
sns.boxplot(x='param_max_iter', y='mean_test_score', data=results, ax=axs[1, 1])
axs[1, 1].set_xlabel('Max Iterations')
axs[1, 1].set_ylabel('Mean Test Score')
axs[1, 1].set_title('SGD Classifier Hyperparameter Tuning')
# Plot mean test score as a function of tol
sns.boxplot(x='param_tol', y='mean_test_score', data=results, ax=axs[2, 0])
axs[2, 0].set_xlabel('Tolerance')
axs[2, 0].set_ylabel('Mean Test Score')
axs[2, 0].set_title('SGD Classifier Hyperparameter Tuning')
# Plot mean test score as a function of learning_rate
sns.boxplot(x='param_learning_rate', y='mean_test_score', data=results, ax=axs[2, 1])
axs[2, 1].set_xlabel('Learning Rate')
axs[2, 1].set_ylabel('Mean Test Score')
axs[2, 1].set_title('SGD Classifier Hyperparameter Tuning')
# Plot mean test score as a function of eta0
sns.boxplot(x='param_eta0', y='mean_test_score', data=results, ax=axs[3, 0])
axs[3, 0].set_xlabel('Eta0')
axs[3, 0].set_ylabel('Mean Test Score')
axs[3, 0].set_title('SGD Classifier Hyperparameter Tuning')
# Plot mean test score as a function of power_t
sns.boxplot(x='param_power_t', y='mean_test_score', data=results, ax=axs[3, 1])
axs[3, 1].set_xlabel('Power T')
axs[3, 1].set_ylabel('Mean Test Score')
axs[3, 1].set_title('SGD Classifier Hyperparameter Tuning')
plt.tight_layout()
plt.show()
print()
print(f"Cumulative Time taken in seconds was {time.time() - start}")
print()
if 'SVM' in optlist:
print()
print("Training with SVM/SVC..")
if gridsrch:
svm = SVC(probability=True)
param_grid = {
'C': [0.1, 1, 10, 100],
'kernel': ['rbf', 'sigmoid'], #'linear', 'poly',
#'degree': [1, 2, 3, 4],
'gamma': ['scale', 'auto', 0.001, 1], #+ list(np.logspace(-5, 3, num=5)),
'coef0': np.linspace(-1, 1, num=11),
'shrinking': [True], #, False
'probability': [True],
'tol': np.logspace(-6, -3, num=3)
}
grid = GridSearchCV(svm, param_grid, cv=5, n_jobs=-1, verbose =3)
grid.fit(X_train_imputed, y_train)
y_pred_svm = grid.predict(X_test_imputed)
y_pred_svm_val = grid.predict(X_val_imputed)
y_pred_svm_train = grid.predict(X_train_imputed)
best_params = grid.best_params_
# Print the best parameters
print("Best parameters: ", best_params)
else:
# Support Vector Machines
svm = SVC(kernel='linear', probability=True)
svm.fit(X_train_imputed, y_train)
y_pred_svm = svm.predict(X_test_imputed)
y_pred_svm_val = svm.predict(X_val_imputed)
print()
y_pred_svm_train = svm.predict(X_train_imputed)
"""svm = LinearSVC(dual=False)
svm.fit(X_train_imputed, y_train)
y_pred_svm = svm.predict(X_test_imputed)
y_pred_svm_val = svm.predict(X_val_imputed)
y_pred_svm_train = svm.predict(X_train_imputed)"""
print()
# Support Vector Machines
print('Support Vector Machines:')
print('Accuracy train:', accuracy_score(y_train, y_pred_svm_train))
print('Accuracy test:', accuracy_score(y_test, y_pred_svm))
print('Accuracy val:', accuracy_score(y_val, y_pred_svm_val))
try:
# compute the predicted labels
if gridsrch:
y_pred = np.argmax(grid.predict_proba(X_test_imputed), axis=1)
y_pred_v = np.argmax(grid.predict_proba(X_val_imputed), axis=1)
else:
y_pred = np.argmax(svm.predict_proba(X_test_imputed), axis=1)
y_pred_v = np.argmax(svm.predict_proba(X_val_imputed), axis=1)
if len(np.unique(y_test)) < 2 or len(np.unique(y_pred)) < 2 or \
len(np.unique(y_val)) < 2 or len(np.unique(y_pred_v)) < 2:
print("ROC AUC score may not be well-defined when not all classes \
are present in y_test or y_pred.")
else:
# compute the ROC AUC score
if gridsrch:
class_index = list(grid.classes_).index('PNEUMONIA')
print('AUC ROC tests:', roc_auc_score(y_test, grid.predict_proba(X_test_imputed)[:, class_index]))
print('AUC ROC val:', roc_auc_score(y_val, grid.predict_proba(X_val_imputed)[:, class_index]))
plot_roc_curve(np.array(y_test), np.array(grid.predict_proba(X_test_imputed)[:, class_index]), class_names)
plot_roc_curve(np.array(y_val), np.array(grid.predict_proba(X_val_imputed)[:, class_index]), class_names)
else:
class_index = list(svm.classes_).index('PNEUMONIA')
print('AUC ROC test:', roc_auc_score(y_test, svm.predict_proba(X_test_imputed)[:, class_index]))
print('AUC ROC val:', roc_auc_score(y_val, svm.predict_proba(X_val_imputed)[:, class_index]))
plot_roc_curve(np.array(y_test), np.array(svm.predict_proba(X_test_imputed)[:, class_index]), class_names)
plot_roc_curve(np.array(y_val), np.array(svm.predict_proba(X_val_imputed)[:, class_index]), class_names)
except Exception as e:
print("From the train_and_evaluate function & SVM section",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
print('Precision test:', precision_score(y_test, y_pred_svm, average='weighted'))
print('Recall test:', recall_score(y_test, y_pred_svm, average='weighted'))
print('Precision val:', precision_score(y_val, y_pred_svm_val, average='weighted'))
print('Recall val:', recall_score(y_val, y_pred_svm_val, average='weighted'))
try:
if 'PNEUMONIA' in np.unique(y_pred_svm_val): #np.unique(y_val) == 7 and
print()
print("Printing how this model fairs for class PNEUMONIA")
evaluate_class(y_pred_svm_val, y_val, 'PNEUMONIA', class_names)
except Exception as e:
print(str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
plot_confusion_matrix(np.array(y_test), np.array(y_pred_svm), class_names)
if gridsrch:
print()
print("Plot the gridsearch results for SVM/SVC")
results = pd.DataFrame(grid.cv_results_)
fig, axs = plt.subplots(3, 2, figsize=(15, 15))
# Plot mean test score as a function of C
axs[0, 0].plot(results['param_C'], results['mean_test_score'])
axs[0, 0].set_xlabel('C')
axs[0, 0].set_ylabel('Mean Test Score')
axs[0, 0].set_title('SVM Hyperparameter Tuning')
# Plot mean test score as a function of kernel
sns.boxplot(x='param_kernel', y='mean_test_score', data=results, ax=axs[0, 1])
axs[0, 1].set_xlabel('Kernel')
axs[0, 1].set_ylabel('Mean Test Score')
axs[0, 1].set_title('SVM Hyperparameter Tuning')
# Plot mean test score as a function of gamma
sns.boxplot(x='param_gamma', y='mean_test_score', data=results, ax=axs[1, 0])
axs[1, 0].set_xlabel('Gamma')
axs[1, 0].set_ylabel('Mean Test Score')
axs[1, 0].set_title('SVM Hyperparameter Tuning')
# Plot mean test score as a function of coef0
sns.boxplot(x='param_coef0', y='mean_test_score', data=results, ax=axs[1, 1])
axs[1, 1].set_xlabel('Coef0')
axs[1, 1].set_ylabel('Mean Test Score')
axs[1, 1].set_title('SVM Hyperparameter Tuning')
# Plot mean test score as a function of shrinking
sns.boxplot(x='param_shrinking', y='mean_test_score', data=results, ax=axs[2, 0])
axs[2, 0].set_xlabel('Shrinking')
axs[2, 0].set_ylabel('Mean Test Score')
axs[2, 0].set_title('SVM Hyperparameter Tuning')
# Plot mean test score as a function of probability
sns.boxplot(x='param_probability', y='mean_test_score', data=results, ax=axs[2, 1])
axs[2, 1].set_xlabel('Probability')
axs[2, 1].set_ylabel('Mean Test Score')
axs[2, 1].set_title('SVM Hyperparameter Tuning')
plt.tight_layout()
plt.show()
print()
print(f"Cumulative Time taken in seconds was {time.time() - start}")
if 'RF' in optlist:
print()
print("Training with Random Forest..")
if gridsrch:
rf = RandomForestClassifier()
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [30, 50, 200, None],
'min_samples_split': [2, 4, 8],
'min_samples_leaf': [1, 2, 8],
'max_features': ['sqrt', 'log2']
}
grid = GridSearchCV(rf, param_grid, cv=5, n_jobs=-1, verbose =1)
grid.fit(X_train_imputed, y_train)
y_pred_rf = grid.predict(X_test_imputed)
y_pred_rf_val = grid.predict(X_val_imputed)
y_pred_rf_train = grid.predict(X_train_imputed)
best_params = grid.best_params_
# Print the best parameters
print("Best parameters: ", best_params)
else:
# Random Forest
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train_imputed, y_train)
y_pred_rf = rf.predict(X_test_imputed)
y_pred_rf_val = rf.predict(X_val_imputed)
y_pred_rf_train = rf.predict(X_train_imputed)
print()
# Random Forest
print('Random Forest:')
print('Accuracy train:', accuracy_score(y_train, y_pred_rf_train))
print('Accuracy test:', accuracy_score(y_test, y_pred_rf))
print('Accuracy val:', accuracy_score(y_val, y_pred_rf_val))
try:
# compute the predicted labels
if gridsrch:
y_pred = np.argmax(grid.predict_proba(X_test_imputed), axis=1)
y_pred_v = np.argmax(grid.predict_proba(X_val_imputed), axis=1)
else:
y_pred = np.argmax(rf.predict_proba(X_test_imputed), axis=1)
y_pred_v = np.argmax(rf.predict_proba(X_val_imputed), axis=1)
if len(np.unique(y_test)) < 2 or len(np.unique(y_pred)) < 2 or \
len(np.unique(y_val)) < 2 or len(np.unique(y_pred_v)) < 2:
print("ROC AUC score may not be well-defined when not all classes \
are present in y_test or y_pred.")
else:
# compute the ROC AUC score
"""y_pred_proba = lr.predict_proba(X_test_imputed)
auc = roc_auc_score(y_test, y_pred_proba, multi_class='ovr')"""
if gridsrch:
class_index = list(grid.classes_).index('PNEUMONIA')
print('AUC ROC tests:', roc_auc_score(y_test,
grid.predict_proba(X_test_imputed)[:, class_index]))
print('AUC ROC val:', roc_auc_score(y_val,
grid.predict_proba(X_val_imputed)[:, class_index]))
plot_roc_curve(np.array(y_test), np.array(grid.predict_proba(X_test_imputed)[:, class_index]), class_names)
plot_roc_curve(np.array(y_val), np.array(grid.predict_proba(X_val_imputed)[:, class_index]), class_names)
else:
class_index = list(rf.classes_).index('PNEUMONIA')
print('AUC ROC test:', roc_auc_score(y_test, rf.predict_proba(X_test_imputed)[:, class_index]))
print('AUC ROC val:', roc_auc_score(y_val, rf.predict_proba(X_val_imputed)[:, class_index]))
plot_roc_curve(np.array(y_test), np.array(rf.predict_proba(X_test_imputed)[:, class_index]), class_names)
plot_roc_curve(np.array(y_val), np.array(rf.predict_proba(X_val_imputed)[:, class_index]), class_names)
except Exception as e:
print("From the train_and_evaluate function & RF section",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
print('Precision test:', precision_score(y_test, y_pred_rf, average='weighted'))
print('Recall test:', recall_score(y_test, y_pred_rf, average='weighted'))
print('Precision val:', precision_score(y_val, y_pred_rf_val, average='weighted'))
print('Recall val:', recall_score(y_val, y_pred_rf_val, average='weighted'))
try:
if 'PNEUMONIA' in np.unique(y_pred_rf_val): #np.unique(y_val) == 7 and
print()
print("Printing how this model fairs for class PNEUMONIA")
evaluate_class(y_pred_rf_val, y_val, 'PNEUMONIA', class_names)
else:
print("Class PNEUMONIA not found in the list")
except Exception as e:
print(str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
if gridsrch:
print("Plot the gridsearch results for Random Forest")
print()
results = pd.DataFrame(grid.cv_results_)
fig, axs = plt.subplots(3, 2, figsize=(15, 15))
# Plot mean test score as a function of n_estimators
axs[0, 0].plot(results['param_n_estimators'], results['mean_test_score'])
axs[0, 0].set_xlabel('Number of Estimators')
axs[0, 0].set_ylabel('Mean Test Score')
axs[0, 0].set_title('Random Forest Hyperparameter Tuning')
# Plot mean test score as a function of max_depth
sns.boxplot(x='param_max_depth', y='mean_test_score', data=results, ax=axs[0, 1])
axs[0, 1].set_xlabel('Max Depth')
axs[0, 1].set_ylabel('Mean Test Score')
axs[0, 1].set_title('Random Forest Hyperparameter Tuning')
# Plot mean test score as a function of min_samples_split
sns.boxplot(x='param_min_samples_split', y='mean_test_score', data=results, ax=axs[1, 0])
axs[1, 0].set_xlabel('Min Samples Split')
axs[1, 0].set_ylabel('Mean Test Score')
axs[1, 0].set_title('Random Forest Hyperparameter Tuning')
# Plot mean test score as a function of min_samples_leaf
sns.boxplot(x='param_min_samples_leaf', y='mean_test_score', data=results, ax=axs[1, 1])
axs[1, 1].set_xlabel('Min Samples Leaf')
axs[1, 1].set_ylabel('Mean Test Score')
axs[1, 1].set_title('Random Forest Hyperparameter Tuning')
# Plot mean test score as a function of max_features
sns.boxplot(x='param_max_features', y='mean_test_score', data=results, ax=axs[2, 0])
axs[2, 0].set_xlabel('Max Features')
axs[2, 0].set_ylabel('Mean Test Score')
axs[2, 0].set_title('Random Forest Hyperparameter Tuning')
plt.tight_layout()
plt.show()
print()
print(f"Cumulative Time taken in seconds was {time.time() - start}")
if 'NB' in optlist:
print()
print("Training with Gaussian Naive bayes..")
if gridsrch:
param_grid = {'var_smoothing': np.logspace(0,-9, num=10)} ##100
# Create a GridSearchCV object
grid_search = GridSearchCV(GaussianNB(), param_grid, cv=5, n_jobs=-1, verbose =1)
# Fit the GridSearchCV object to the data
grid_search.fit(X_train_imputed, y_train)
# Print the best parameters and the best score
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")
# Use the best estimator to make predictions on the test set
y_pred_nb = grid_search.best_estimator_.predict(X_test_imputed)
y_pred_nb_val = nb.predict(X_val_imputed)
else:
# Naive Bayes
nb = GaussianNB()
nb.fit(X_train_imputed, y_train)
y_pred_nb = nb.predict(X_test_imputed)
y_pred_nb_val = nb.predict(X_val_imputed)
print()
# Naive Bayes
print('Naive Bayes:')
print('Accuracy test:', accuracy_score(y_test, y_pred_nb))
print('Accuracy val:', accuracy_score(y_val, y_pred_nb_val))
try:
# compute the predicted labels
if gridsrch:
y_pred = np.argmax(grid.predict_proba(X_test_imputed), axis=1)
y_pred_v = np.argmax(grid.predict_proba(X_val_imputed), axis=1)
else:
y_pred = np.argmax(nb.predict_proba(X_test_imputed), axis=1)
y_pred_v = np.argmax(nb.predict_proba(X_val_imputed), axis=1)
if len(np.unique(y_test)) < 2 or len(np.unique(y_pred)) < 2 or \
len(np.unique(y_val)) < 2 or len(np.unique(y_pred_v)) < 2:
print("ROC AUC score may not be well-defined when not all classes \
are present in y_test or y_pred.")
else:
# compute the ROC AUC score
"""y_pred_proba = lr.predict_proba(X_test_imputed)
auc = roc_auc_score(y_test, y_pred_proba, multi_class='ovr')"""
if gridsrch:
print('AUC ROC tests:', roc_auc_score(y_test,
grid.predict_proba(X_test_imputed), multi_class='ovr'))
print('AUC ROC val:', roc_auc_score(y_val,
grid.predict_proba(X_val_imputed), multi_class='ovr'))
else:
print('AUC ROC test:', roc_auc_score(y_test, rf.predict_proba(X_test_imputed),
multi_class='ovr'))
print('AUC ROC val:', roc_auc_score(y_val, rf.predict_proba(X_val_imputed),
multi_class='ovr'))
except Exception as e:
print("From the train_and_evaluate function & GNB section",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
print('Precision test:', precision_score(y_test, y_pred_nb, average='weighted'))
print('Recall test:', recall_score(y_test, y_pred_nb, average='weighted'))
print('Precision val:', precision_score(y_val, y_pred_nb_val, average='weighted'))
print('Recall val:', recall_score(y_val, y_pred_nb_val, average='weighted'))
try:
if 'PNEUMONIA' in np.unique(y_pred_nb_val): #np.unique(y_val) == 7 and
print()
print("Printing how this model fairs for class PNEUMONIA")
evaluate_class(y_pred_nb_val, y_val, 'PNEUMONIA', class_names)
except Exception as e:
print(str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
if gridsrch:
print("Plot the gridsearch results for Naive Bayes")
results = pd.DataFrame(grid_search.cv_results_)
plt.plot(results['param_var_smoothing'], results['mean_test_score'])
plt.xscale('log')
plt.xlabel('Var Smoothing')
plt.ylabel('Mean Test Score')
plt.title('Gaussian Naive Bayes Hyperparameter Tuning')
plt.show()
print()
print(f"Cumulative Time taken was {time.time() - start}")
if 'KNN' in optlist:
print()
if gridsrch:
print("Training with KNN..")
# k-Nearest Neighbors
knn = KNeighborsClassifier()
param_grid = {
'n_neighbors': [10, 30],
'weights': ['distance'],
'algorithm': ['kd_tree'],
'leaf_size': [20],
'p': [2],
'metric': ['minkowski']
}
"""param_grid = {
'n_neighbors': [10, 30], #3, 5, 7, 9, 12,
'weights': ['distance'], #'uniform',
'algorithm': ['kd_tree', 'brute','ball_tree'], #'auto','kd_tree',, 'brute'
'leaf_size': [20, 50, 100, 200], #10, 20,, 40, 50
'p': [2],#1,
'metric': ['minkowski', 'euclidean', 'manhattan'] #
}"""
grid = GridSearchCV(knn, param_grid, cv=5, n_jobs=-1, verbose = 15)
grid.fit(X_train_imputed, y_train)
y_pred_knn = grid.predict(X_test_imputed)
y_pred_knn_val = grid.predict(X_val_imputed)
y_pred_knn_train = grid.predict(X_train_imputed)
print()
print('k-Nearest Neighbors:')
print('Best parameters:', grid.best_params_)
print('Accuracy train:', accuracy_score(y_train, y_pred_knn_train))
print('Accuracy test:', accuracy_score(y_test, y_pred_knn))
print('Accuracy val:', accuracy_score(y_val, y_pred_knn_val))
print('Precision:', precision_score(y_test, y_pred_knn, average='weighted', zero_division=0))
print('Recall:', recall_score(y_test, y_pred_knn, average='weighted'))
print('Precision:', precision_score(y_val, y_pred_knn_val, average='weighted', zero_division=0))
print('Recall:', recall_score(y_val, y_pred_knn_val, average='weighted'))
try:
# compute the predicted labels
y_pred = np.argmax(grid.predict_proba(X_test_imputed), axis=1)
y_pred_v = np.argmax(grid.predict_proba(X_val_imputed), axis=1)
if len(np.unique(y_test)) < 2 or len(np.unique(y_pred)) < 2 or \
len(np.unique(y_val)) < 2 or len(np.unique(y_pred_v)) < 2:
print("ROC AUC score may not be well-defined when not all classes \
are present in y_test or y_pred.")
else:
# compute the ROC AUC score
class_index = list(grid.classes_).index('PNEUMONIA')
print('AUC ROC tests:', roc_auc_score(y_test,
grid.predict_proba(X_test_imputed)[:, class_index]))
print('AUC ROC val:', roc_auc_score(y_val,
grid.predict_proba(X_val_imputed)[:, class_index]))
except Exception as e:
print("From the train_and_evaluate function & KNN section",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
try:
if 'PNEUMONIA' in np.unique(y_pred_knn_val): #np.unique(y_val) == 7 and
print()
print("Printing how this model fairs for class PNEUMONIA")
evaluate_class(y_pred_knn_val, y_val, 'PNEUMONIA', class_names)
except Exception as e:
print("From the train_and_evaluate function & KNN section",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
# Plot the results
results = pd.DataFrame(grid.cv_results_)
fig, axs = plt.subplots(3, 2, figsize=(15, 15))
# Plot mean test score as a function of n_neighbors
axs[0, 0].plot(results['param_n_neighbors'], results['mean_test_score'])
axs[0, 0].set_xlabel('Number of Neighbors')
axs[0, 0].set_ylabel('Mean Test Score')
axs[0, 0].set_title('KNN Hyperparameter Tuning')
# Plot mean test score as a function of weights
sns.boxplot(x='param_weights', y='mean_test_score', data=results, ax=axs[0, 1])
axs[0, 1].set_xlabel('Weights')
axs[0, 1].set_ylabel('Mean Test Score')
axs[0, 1].set_title('KNN Hyperparameter Tuning')
# Plot mean test score as a function of algorithm
sns.boxplot(x='param_algorithm', y='mean_test_score', data=results, ax=axs[1, 0])
axs[1, 0].set_xlabel('Algorithm')
axs[1, 0].set_ylabel('Mean Test Score')
axs[1, 0].set_title('KNN Hyperparameter Tuning')
# Plot mean test score as a function of leaf_size
sns.boxplot(x='param_leaf_size', y='mean_test_score', data=results, ax=axs[1, 1])
axs[1, 1].set_xlabel('Leaf Size')
axs[1, 1].set_ylabel('Mean Test Score')
axs[1, 1].set_title('KNN Hyperparameter Tuning')
# Plot mean test score as a function of p
sns.boxplot(x='param_p', y='mean_test_score', data=results, ax=axs[2, 0])
axs[2, 0].set_xlabel('P')
axs[2, 0].set_ylabel('Mean Test Score')
axs[2, 0].set_title('KNN Hyperparameter Tuning')
# Plot mean test score as a function of metric
sns.boxplot(x='param_metric', y='mean_test_score', data=results, ax=axs[2, 1])
axs[2, 1].set_xlabel('Metric')
axs[2, 1].set_ylabel('Mean Test Score')
axs[2, 1].set_title('KNN Hyperparameter Tuning')
plt.tight_layout()
plt.show()
print()
print(f"Cumulative Time taken was {time.time() - start}")
print()
I decided to use Feature Extraction and then using ML training on them to create models for this binary image classification problem.
We first ensure class balance by selecting equal number of images per class & then extract sophisticated features from the images.
Next, we run the features through 2 distinct approaches - PCA & Non-PCA based Dimensionality & Noise reduction & then applying multiple types of Optimization Algorithms/Machine Learning Optimization.
In this section there are 2 types of functions definitions.
mlfeaturizationandtraining()
&
PCAbasedanalysis()
The first one Generates features that are large and complex liek Combination Features. The second one does PCA reduction on embeddding features & more (although we are not using PCA a lot in this project becuase it did not perform or learning how to make it perform is a post project task. Please see resutls of PCA towards the end)
The parameters passed to the mlfeaturizationandtraining function are:
num_images_per_class Number of samples per class; it has to be equal no of samples to ensure class balancemain_folder Main folder path where the dataset isparent_folder Parent folder path to save the features as .npygridsrch Whether or not to do grid searchload Whether to load data from .npy files, these are extracted featurestne Whether to do training and evaluation or just extract featureswhichftr Which features to run training and eval on - all mean all
one mean VGG16 with 224x224, two means Resnet50, three means VGG16 600x450 & four means the feature combo.optslist Optimizers to use for trainingsimplifyfeatures Whether to simplify the feature vector using Variance & K Bestpct Percentage of features for K Bestcolor_space_clhce which color space to use for Color HistogramPCA FUNCTION
The parameters passed to the PCAbasedanalysis function are:
num_images_per_class Number of samples per class; it has to be equal no of samples to ensure class balanceparent_folder Parent folder path to save the features as .npyk Number of principal componentswhichftr which of the 5 feature types to usealg which algorithm/optimizer to usescreeapp Whether to use PCA components based on Explained Variance Countgridsrch whether to dp grid search for hyper parametersStrat Whether to use Stratified cross validationsen_y whether to use Custom scoring in the GridSearchCVobjective_trial Objective trials to find the best algorithm and HPsIn this function we extract various types of features.
block5_conv3conv5_block3_outC-Histogram + LBP + HOG + Contours + EdgesThe parameters are:
X the train,test, val list of file pathsn_components_pca This has become useless since I don't use PCA any more but it was used initially to ensure that across train/test/val we use the same no of components or the Training/Eval to work; Retaining for future purposes.color_space_clhce Which color space to use for C-Histograms since it has been seen that LAB & HSV work well in terms of model perfHere is a list of color spaces that we intend to support eventually. Currently, only BGR & LAB are supported.
def extract_features(X, n_components_pca = 0, color_space_clhce = 'BGR'):
"""
Extract PCA, HOG, CNN+PCA & Color Histogram, contour, edge features from the images
Note: One more step could be to use grayscale images along with color histogram
Also, we need to see if we can merge all these together and create one feature set instead
of 4.
IMPORTANT : What has not been done yet?
FINETUNING will be added in the future and may be more compute and memory intensive
Batching of feature extraction will be done in the future as well to ensure we can
further minimize the compute requirement.
"""
pca_cnn_features = []
cnn_features = []
hog_features = []
color_hist_features = []
pca_direct_features = []
color_hist_features = []
lbp_features = []
hog_features = []
contour_features = []
edge_features = []
try:
tf.config.threading.set_inter_op_parallelism_threads(32)
tf.config.threading.set_intra_op_parallelism_threads(32)
url = 'https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5'
file_path = './vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5'
# Download the weights file
response = requests.get(url, verify=False)
with open(file_path, 'wb') as f:
f.write(response.content)
# Load the VGG16 model with the downloaded weights
model = VGG16(weights=file_path, include_top=False)
## Extracing CNN features and then PCA from it
#model = VGG16(weights='imagenet', include_top=False)
print("==> Extracting CNN features with default image size without resizing") #Color Histogram &
for file_path in tqdm(X):
#print(file_path)
img = image.load_img(file_path, target_size=(800, 600))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = vgg16_preprocess_input(img_data)
cnn_feature = model.predict(img_data, verbose =0 )
cnn_features.append(cnn_feature.flatten())
print("Without Resizing VGG16 feature extraction completed", np.array(cnn_features).shape)
print()
## Extract as 224 x 224 since it tends to perform better from a specific layer
# Due to limitation of resources, we cannot test all possible layers that may be
# Good from the model metrics POV
cnn_features224_vgg = []
print("==> Extracting CNN 224 x 224 features with resizing")
"""#base_model = VGG16(weights='imagenet')
base_model = VGG16(weights= file_path)
model = Model(inputs=base_model.input, outputs=base_model.get_layer('block5_conv3').output)
#model = VGG16(weights='imagenet', include_top=False)"""
url = 'https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels.h5'
file_path = './vgg16_weights_tf_dim_ordering_tf_kernels.h5'
# Download the weights file
response = requests.get(url, verify=False)
with open(file_path, 'wb') as f:
f.write(response.content)
#### IMPORTANT NOTE; The current approach does not do any finetuning on the VGG16 or Resnet50/Resnet101 models
### FINETUNING will be added in the future and may be more compute and memory intensive
# Load the pre-trained VGG16 model with the downloaded weights
base_model = VGG16(weights=file_path)
# Create a new model that outputs the features from the specified layer
model = Model(inputs=base_model.input, outputs=base_model.get_layer('block5_conv3').output)
for file_path in tqdm(X):
img = image.load_img(file_path, target_size=(224, 224))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = vgg16_preprocess_input(img_data)
cnn_feature224 = model.predict(img_data, verbose =0 )
cnn_features224_vgg.append(cnn_feature224.flatten())
print("With Resizing VGG16 feature extraction completed")
print()
print("==> Extracting Resnet50 224 x 224 features with resize")
# Load the pre-trained ResNet50 model
"""base_model = ResNet50(weights='imagenet')
# Specify the layer name from which you want to extract features
layer_name = 'conv5_block3_out'
# Create a new model that outputs the features from the specified layer
model = Model(inputs=base_model.input, outputs=base_model.get_layer(layer_name).output)"""
urlr = 'https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels.h5'
file_path_r = './resnet50_weights_tf_dim_ordering_tf_kernels.h5'
# Download the weights file
responser = requests.get(urlr, verify=False)
with open(file_path_r, 'wb') as fr:
fr.write(responser.content)
# Load the pre-trained ResNet50 model with the downloaded weights
base_model = ResNet50(weights=file_path_r)
# Specify the name of the layer from which to extract features
layer_name = 'conv5_block3_out'
# Create a new model that outputs the features from the specified layer
model = Model(inputs=base_model.input, outputs=base_model.get_layer(layer_name).output)
cnn_features_resnet = []
for file_path in tqdm(X):
# Load and preprocess the image
img = image.load_img(file_path, target_size=(224, 224))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = resnet50_preprocess_input(img_data)
# Extract the CNN features
cnn_featurer = model.predict(img_data, verbose=0)
cnn_features_resnet.append(cnn_featurer.flatten())
print("With Resizing RESNET50 feature extraction completed")
print()
## KEEPING THIS CODE FOR FUTURE LEARNINGs
"""print("==> Extracting PCA from CNN with un-resized images")
print("Shape before PCA", np.array(cnn_features).shape)
# Transform the data using the chosen number of components
if n_components_pca == 0:
pca = PCA()
pca.fit(cnn_features)
# Calculate the cumulative explained variance
cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)
print("Initial PCA number of components that are explaining 95% of variance is", n_components_pca)
print()
# Choose the number of components such that 95% of the total variance is retained
n_components_pca = np.where(cumulative_explained_variance >= 0.95)[0][0] + 1
print(f"Number of PCA components that retain 95% of the total explained variance for this set: \
{n_components_pca}")
print("==> Scaling CNN features")
scaler = StandardScaler()
cnn_features_scaled = scaler.fit_transform(cnn_features)
print("==> Fitting PCA to scaled CNN features")
pca = PCA(n_components=n_components_pca)
pca.fit(cnn_features_scaled)
# transform new data using the pre-fit PCA object
pca_cnn_features = pca.transform(cnn_features_scaled)
print("Shape after PCA", np.array(pca_cnn_features).shape)"""
print()
print("==> Extracting Color Hist, LBP, Hog, Contour & Edge features stacked into one")
## We need to extract features using the LAB color space & may be simplify this further.
st = time.time()
color_hist_features = []
lbp_features = []
hog_features = []
contour_features = []
edge_features = []
c = 0
for file_path in tqdm(X):
img = cv2.imread(file_path)
img = cv2.resize(img, (1200, 1500))
if color_space_clhce == 'BGR':
color_hist_feature = cv2.calcHist([img], [0, 1, 2], None,
[8, 8, 8], [0, 256, 0, 256, 0, 256])
elif color_space_clhce == 'LAB':
lab_img = cv2.cvtColor(img, cv2.COLOR_BGR2LAB) # Extracting from LAB color space
color_hist_feature = cv2.calcHist([lab_img], [0, 1, 2], None,
[8, 8, 8], [0, 256, 0, 256, 0, 256])
else:
print("The value of parameter color_space_clhce is not supported yet! Use BGR or LAB")
break
if not np.isnan(color_hist_feature).any():
color_hist_features.append(color_hist_feature.flatten())
else:
color_hist_features.append(np.zeros_like(color_hist_feature.flatten()))
#print("CHIST shape", np.array(color_hist_features).shape)
if len(img.shape) == 2:
gray_img = img
else:
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# LBP
lbp_feature = local_binary_pattern(gray_img, 8, 1)
lbp_feature = lbp_feature.T
if not np.isnan(lbp_feature).any():
lbp_features.append(lbp_feature.flatten())
else:
#print("LBP nan else")
lbp_features.append(np.zeros_like(lbp_feature.flatten()))
#print("LBP shapes", np.array(lbp_feature).shape, np.array(lbp_features[c].shape))
# HOG
hog_feature = hog(gray_img)
hog_feature = hog_feature.T
if not np.isnan(hog_feature).any():
hog_features.append(hog_feature)
else:
hog_features.append(np.zeros_like(hog_feature))
#print("HOG", np.array(hog_feature).shape, hog_features[c].shape)
try:
contours, _ = cv2.findContours(gray_img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
contour_area = [cv2.contourArea(contour) for contour in contours]
if len(contour_area) > 0:
contour_feature = np.array(contour_area).mean()
else:
contour_feature = 0
contour_features.append(contour_feature)
#print("Countour shapes", np.array(contour_feature).shape, np.array(contour_features[c].shape))
except Exception as e:
print(f"Error extracting contour feature for image {file_path}: {e}")
# Edge
try:
edge_img = cv2.Canny(gray_img, 100, 200)
if not np.isnan(edge_img).any():
edge_feature = edge_img.mean()
else:
edge_feature = 0
edge_features.append(edge_feature)
#print("Edge shapes", np.array(edge_feature).shape, np.array(edge_features[c].shape))
except Exception as e:
print(f"Error extracting edge feature for image {file_path}: {e}")
print("Image shape", gray_img.shape)
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
c += 1
print()
print(f"Cumulative Time taken for extracing all complex features (seconds): {time.time() - st}")
## Combine all 5 types of features into one array
try:
print("Printing the shapes of the feature arrays before calling hstack")
print("CHIST", np.array(color_hist_features).shape)
print("LBP",np.array(lbp_features).shape)
print("HOG",np.array(hog_features).shape)
print("CONTOUR", np.array(contour_features).shape)
print("EDGE",np.array(edge_features).shape)
print()
color_hist_lbp_hog_contour_edge = np.hstack([color_hist_features, lbp_features, hog_features,
np.array(contour_features).reshape(-1, 1),
np.array(edge_features).reshape(-1, 1)])
except Exception as e:
print(np.array(color_hist_features).shape)
print(np.array(lbp_features).shape)
print(np.array(hog_features).shape)
print(np.array(contour_features).shape)
print(np.array(edge_features).shape)
print(f"Error stacking images using hstack: {e}")
print()
print(f"Cumulative Time taken so far for stacking all features (seconds): {time.time() - st}")
print()
print(f"Size of the CLHCE feature vector {color_hist_lbp_hog_contour_edge.shape}")
print()
except Exception as e:
print("Exception in extract_features", str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
#Unused :pca_direct_features,color_hist_features,color_hist_lbp_hog, hog_features
return cnn_features, color_hist_lbp_hog_contour_edge, \
cnn_features224_vgg, cnn_features_resnet
Here we do Feature Extraction and Training depending on the paramter passed
In this case, we use Variance and K Best approach of ML Traning.
This function makes calls to all other functions like extract_features & train_and_evaluate & gets called directly from the
EXPERIMENTATION section
def mlfeaturizationandtraining(num_images_per_class, main_folder, parent_folder, gridsrch, load,
tne = True, whichftr = 'all',optslist = 'LR,CLF,SVM,KNN,NB,RF',
simplifyfeatures = False, pct = 0.95, color_space_clhce = 'BGR'):
"""
This function extracts features from the images & then does logistic regression
& then plots the graphs
The features we are extracting are PCA, HOG, CNN + PCA, Color Histogram
"""
subfolders = ['Training','Test', 'Validation']
class_names = ['NORMAL','PNEUMONIA']
#subfolders = ['Check']
#class_names = ['NORMAL'] #,'PNEUMONIA'
try:
tracemalloc.start()
# Create a directory to store the feature files
feature_dir = 'features' + str(num_images_per_class)
# Create the full path to the feature directory
feature_path = os.path.join(parent_folder, feature_dir)
if not load:
print(f"Will save the .npy files in {feature_path}")
if load:
print()
print("Loading saved features")
# Load the extracted features from files
if whichftr == 'all' or whichftr == 'three':
X_train_cnn = np.load(os.path.join(feature_path, 'X_train_cnn.npy')).astype(np.float32)
if whichftr == 'all' or whichftr == 'four':
X_train_color_hist_lbp_hog = np.load(os.path.join(feature_path, 'X_train_color_hist_lbp_hog.npy')).astype(np.float32)
if whichftr == 'all' or whichftr == 'one':
X_train_cnn_features224_vgg = np.load(os.path.join(feature_path, 'X_train_cnn_features224_vgg.npy')).astype(np.float32)
if whichftr == 'all' or whichftr == 'two':
X_train_cnn_features_resnet = np.load(os.path.join(feature_path, 'X_train_cnn_features_resnet.npy')).astype(np.float32)
y_train = np.load(os.path.join(feature_path, 'Y_train.npy'))
if whichftr == 'all' or whichftr == 'three':
X_test_cnn = np.load(os.path.join(feature_path, 'X_test_cnn.npy')).astype(np.float32)
if whichftr == 'all' or whichftr == 'four':
X_test_color_hist_lbp_hog = np.load(os.path.join(feature_path, 'X_test_color_hist_lbp_hog.npy')).astype(np.float32)
if whichftr == 'all' or whichftr == 'one':
X_test_cnn_features224_vgg = np.load(os.path.join(feature_path, 'X_test_cnn_features224_vgg.npy')).astype(np.float32)
if whichftr == 'all' or whichftr == 'two':
X_test_cnn_features_resnet = np.load(os.path.join(feature_path, 'X_test_cnn_features_resnet.npy')).astype(np.float32)
y_test = np.load(os.path.join(feature_path, 'Y_test.npy'))
if whichftr == 'all' or whichftr == 'three':
X_val_cnn = np.load(os.path.join(feature_path, 'X_val_cnn.npy')).astype(np.float32)
if whichftr == 'all' or whichftr == 'four':
X_val_color_hist_lbp_hog = np.load(os.path.join(feature_path, 'X_val_color_hist_lbp_hog.npy')).astype(np.float32)
if whichftr == 'all' or whichftr == 'one':
X_val_cnn_features224_vgg = np.load(os.path.join(feature_path, 'X_val_cnn_features224_vgg.npy')).astype(np.float32)
if whichftr == 'all' or whichftr == 'two':
X_val_cnn_features_resnet = np.load(os.path.join(feature_path, 'X_val_cnn_features_resnet.npy')).astype(np.float32)
y_val = np.load(os.path.join(feature_path, 'Y_val.npy'))
else:
X_train, y_train, X_test, y_test, X_val, y_val = get_data(main_folder, subfolders, class_names,
num_images_per_class)
print("Sizes of Train, Test & Validation arrays", len(X_train), len(X_test),len(X_val))
print()
start = time.time()
#n_components_pca = 0
n_components_pca = min(len(X_train), len(X_test), len(X_val))
## Saving the features
print()
print(f"\033[1m\033[4mExtracting features for Training\033[0m")
# Create the feature directory if it does not exist
if not os.path.exists(feature_path):
os.makedirs(feature_path)
print()
X_train_cnn, X_train_color_hist_lbp_hog, \
X_train_cnn_features224_vgg, X_train_cnn_features_resnet= extract_features(X_train,
n_components_pca,
color_space_clhce)
print()
print(f"\033[1m\033[4m===>Saving in {feature_path}\033[0m")
# Save the extracted features to files
np.save(os.path.join(feature_path, 'X_train_cnn.npy'), X_train_cnn)
np.save(os.path.join(feature_path, 'X_train_color_hist_lbp_hog.npy'), X_train_color_hist_lbp_hog)
np.save(os.path.join(feature_path, 'X_train_cnn_features224_vgg.npy'), X_train_cnn_features224_vgg)
np.save(os.path.join(feature_path, 'X_train_cnn_features_resnet.npy'), X_train_cnn_features_resnet)
np.save(os.path.join(feature_path, 'Y_train.npy'), y_train)
#print(X_train_pca2[:1])
print()
print(f"\033[1m\033[4mExtracting features for Testing\033[0m")
print()
#,X_test_pca_d, X_test_color_hist,X_test_hog
X_test_cnn, X_test_color_hist_lbp_hog, \
X_test_cnn_features224_vgg, X_test_cnn_features_resnet= extract_features(X_test,
tr_n_components_pca,
color_space_clhce)
print()
print(f"\033[1m\033[4m===>Saving in {feature_path}\033[0m")
np.save(os.path.join(feature_path, 'X_test_cnn.npy'), X_test_cnn)
np.save(os.path.join(feature_path, 'X_test_color_hist_lbp_hog.npy'), X_test_color_hist_lbp_hog)
np.save(os.path.join(feature_path, 'X_test_cnn_features224_vgg.npy'), X_test_cnn_features224_vgg)
np.save(os.path.join(feature_path, 'X_test_cnn_features_resnet.npy'), X_test_cnn_features_resnet)
np.save(os.path.join(feature_path, 'Y_test.npy'), y_test)
print()
print(f"\033[1m\033[4mExtracting features for Validation\033[0m")
print()
X_val_cnn, X_val_color_hist_lbp_hog, \
X_val_cnn_features224_vgg, X_val_cnn_features_resnet= extract_features(X_val,
tr_n_components_pca,
color_space_clhce)
print()
print(f"\033[1m\033[4m===>Saving in {feature_path}\033[0m")
np.save(os.path.join(feature_path, 'X_val_cnn.npy'), X_val_cnn)
np.save(os.path.join(feature_path, 'X_val_color_hist_lbp_hog.npy'), X_val_color_hist_lbp_hog)
np.save(os.path.join(feature_path, 'X_val_cnn_features224_vgg.npy'), X_val_cnn_features224_vgg)
np.save(os.path.join(feature_path, 'X_val_cnn_features_resnet.npy'), X_val_cnn_features_resnet)
np.save(os.path.join(feature_path, 'Y_val.npy'), y_val)
print()
print(f"Time taken for feature extraction {time.time() - start} seconds")
if tne: ## Runs if we want to train & evaluate
print(f"\033[1m\033[4mTrain & Evaluate section with a choice of 6 optimizers & 4 features ==>\033[0m")
print()
st = time.time()
print()
if whichftr == 'one' or whichftr == 'all':
if simplifyfeatures:
print("Doing feature selection using Variance & K Best")
variance_selector = VarianceThreshold(threshold = 0)
# Fit the variance_selector to the training data
variance_selector.fit(X_train_cnn_features224_vgg)
# Transform the training data
X_train_cnn_features224_vgg_selected = variance_selector.transform(X_train_cnn_features224_vgg)
print("Original feature count", X_train_cnn_features224_vgg.shape[1])
# Define the number of features to select
k = int(X_train_cnn_features224_vgg.shape[1] * pct) # selecting 85%
# Create a SelectKBest object with the f_classif scoring function
kbest_selector = SelectKBest(f_classif, k=k)
# Fit the kbest_selector to the transformed training data
kbest_selector.fit(X_train_cnn_features224_vgg_selected, y_train)
# Transform the training, test, and validation sets
X_train_cnn_features224_vgg_selected = kbest_selector.transform(X_train_cnn_features224_vgg_selected)
X_test_cnn_features224_vgg_selected = kbest_selector.transform(variance_selector.transform(X_test_cnn_features224_vgg))
X_val_cnn_features224_vgg_selected = kbest_selector.transform(variance_selector.transform(X_val_cnn_features224_vgg))
print()
print(f"Cumulative Time taken: {time.time() - st}")
# Check the shape of the selected features
print("X_train selected shape:", X_train_cnn_features224_vgg_selected.shape)
print("X_test selected shape:", X_test_cnn_features224_vgg_selected.shape)
print("X_val selected shape:", X_val_cnn_features224_vgg_selected.shape)
train_and_evaluate(np.array(X_train_cnn_features224_vgg_selected).reshape(len(X_train_cnn_features224_vgg_selected), -1),
np.array(y_train),
np.array(X_test_cnn_features224_vgg_selected).reshape(len(X_test_cnn_features224_vgg_selected), -1),
np.array(y_test),
np.array(X_val_cnn_features224_vgg_selected).reshape(len(X_val_cnn_features224_vgg_selected), -1),
np.array(y_val),gridsrch,optslist)
print()
print(f"Cumulative Time taken: {time.time() - st}")
else:
print(f'\033[1m\033[4mVGG 224 CNN features with test/val dataset with \
{num_images_per_class} samples\033[0m')
train_and_evaluate(np.array(X_train_cnn_features224_vgg).reshape(len(X_train_cnn_features224_vgg), -1),
np.array(y_train),
np.array(X_test_cnn_features224_vgg).reshape(len(X_test_cnn_features224_vgg), -1),
np.array(y_test),
np.array(X_val_cnn_features224_vgg).reshape(len(X_val_cnn_features224_vgg), -1),
np.array(y_val),gridsrch,optslist)
print()
print(f"Cumulative Time taken: {time.time() - st}")
if whichftr == 'two' or whichftr == 'all':
if simplifyfeatures:
print("Doing feature selection using Variance & K Best")
variance_selector = VarianceThreshold(threshold = 0)
# Fit the variance_selector to the training data
variance_selector.fit(X_train_cnn_features_resnet)
# Transform the training data
X_train_cnn_features_resnet_selected = variance_selector.transform(X_train_cnn_features_resnet)
print("Original feature count", X_train_cnn_features_resnet.shape[1])
# Define the number of features to select
k = int(X_train_cnn_features_resnet.shape[1] * pct) # selecting 85%
# Create a SelectKBest object with the f_classif scoring function
kbest_selector = SelectKBest(f_classif, k=k)
# Fit the kbest_selector to the transformed training data
kbest_selector.fit(X_train_cnn_features_resnet_selected, y_train)
# Transform the training, test, and validation sets
X_train_cnn_features_resnet_selected = kbest_selector.transform(X_train_cnn_features_resnet_selected)
X_test_cnn_features_resnet_selected = kbest_selector.transform(variance_selector.transform(X_test_cnn_features_resnet))
X_val_cnn_features_resnet_selected = kbest_selector.transform(variance_selector.transform(X_val_cnn_features_resnet))
print()
print(f"Cumulative Time taken: {time.time() - st}")
# Check the shape of the selected features
print("X_train selected shape:", X_train_cnn_features_resnet_selected.shape)
print("X_test selected shape:", X_test_cnn_features_resnet_selected.shape)
print("X_val selected shape:", X_val_cnn_features_resnet_selected.shape)
train_and_evaluate(np.array(X_train_cnn_features_resnet_selected).reshape(len(X_train_cnn_features_resnet_selected), -1),
np.array(y_train),
np.array(X_test_cnn_features_resnet_selected).reshape(len(X_test_cnn_features_resnet_selected), -1),
np.array(y_test),
np.array(X_val_cnn_features_resnet_selected).reshape(len(X_val_cnn_features_resnet_selected), -1),
np.array(y_val),gridsrch,optslist)
print()
print(f"Cumulative Time taken: {time.time() - st}")
else:
print()
print(f'\033[1m\033[4mResnet 224 CNN features with test/val dataset with \
{num_images_per_class} samples\033[0m')
train_and_evaluate(np.array(X_train_cnn_features_resnet).reshape(len(X_train_cnn_features_resnet), -1),
np.array(y_train),
np.array(X_test_cnn_features_resnet).reshape(len(X_test_cnn_features_resnet), -1),
np.array(y_test),
np.array(X_val_cnn_features_resnet).reshape(len(X_val_cnn_features_resnet), -1),
np.array(y_val),gridsrch,optslist)
print()
print(f"Cumulative Time taken so far (seconds): {time.time() - st}")
print()
if whichftr == 'three' or whichftr == 'all':
if simplifyfeatures: # Whether to reduce feature space
print()
print("Doing feature selection using Variance & K Best")
variance_selector = VarianceThreshold(threshold = 0)
# Fit the variance_selector to the training data
variance_selector.fit(X_train_cnn)
# Transform the training data
X_train_cnn_selected = variance_selector.transform(X_train_cnn)
print("Original feature count", X_train_cnn.shape[1])
# Define the number of features to select
k = int(X_train_cnn.shape[1] * pct)
# Create a SelectKBest object with the f_classif scoring function
kbest_selector = SelectKBest(f_classif, k=k)
# Fit the kbest_selector to the transformed training data
kbest_selector.fit(X_train_cnn_selected, y_train)
# Transform the training, test, and validation sets
X_train_cnn_selected = kbest_selector.transform(X_train_cnn_selected)
X_test_cnn_selected = kbest_selector.transform(variance_selector.transform(X_test_cnn))
X_val_cnn_selected = kbest_selector.transform(variance_selector.transform(X_val_cnn))
# Check the shape of the selected features
print("X_train selected shape:", X_train_cnn_selected.shape)
print("X_test selected shape:", X_test_cnn_selected.shape)
print("X_val selected shape:", X_val_cnn_selected.shape)
print(f'\033[1m\033[4mColor histogram/lbp/hog/contour/edge features with test/val dataset with \
{num_images_per_class} samples\033[0m')
train_and_evaluate(np.array(X_train_cnn_selected).reshape(len(X_train_cnn_selected)
, -1),
np.array(y_train),
np.array(X_test_cnn_selected).reshape(len(X_test_cnn_selected)
, -1),
np.array(y_test),
np.array(X_val_cnn_selected).reshape(len(X_val_cnn_selected)
, -1),
np.array(y_val),gridsrch,optslist)
else:
print()
print(f'\033[1m\033[4mCNN 600 x 450 features with test/val dataset with \
{num_images_per_class} samples\033[0m')
train_and_evaluate(np.array(X_train_cnn).reshape(len(X_train_cnn), -1), np.array(y_train),
np.array(X_test_cnn).reshape(len(X_test_cnn), -1), np.array(y_test),
np.array(X_val_cnn).reshape(len(X_val_cnn), -1), np.array(y_val),
gridsrch,optslist)
print()
print(f"Cumulative Time taken so far (seconds): {time.time() - st}")
if whichftr == 'four' or whichftr == 'all':
if simplifyfeatures:
print()
print("Doing feature selection using Variance & K Best")
print("Original feature count", X_train_color_hist_lbp_hog.shape[1])
variance_selector = VarianceThreshold(threshold = 0)
# Define the number of features to select
k = int(X_train_color_hist_lbp_hog.shape[1] * pct)
# Fit the variance_selector to the training data
variance_selector.fit(X_train_color_hist_lbp_hog)
# Transform the training data
X_train_color_hist_lbp_hog_selected = variance_selector.transform(X_train_color_hist_lbp_hog)
del X_train_color_hist_lbp_hog
# Create a SelectKBest object with the f_classif scoring function
kbest_selector = SelectKBest(f_classif, k=k)
# Fit the kbest_selector to the transformed training data
kbest_selector.fit(X_train_color_hist_lbp_hog_selected, y_train)
# Transform the training, test, and validation sets
X_train_color_hist_lbp_hog_selected = kbest_selector.transform(X_train_color_hist_lbp_hog_selected)
X_test_color_hist_lbp_hog_selected = kbest_selector.transform(variance_selector.transform(X_test_color_hist_lbp_hog))
X_val_color_hist_lbp_hog_selected = kbest_selector.transform(variance_selector.transform(X_val_color_hist_lbp_hog))
del X_test_color_hist_lbp_hog
del X_val_color_hist_lbp_hog
# Check the shape of the selected features
print("X_train selected shape:", X_train_color_hist_lbp_hog_selected.shape)
print("X_test selected shape:", X_test_color_hist_lbp_hog_selected.shape)
print("X_val selected shape:", X_val_color_hist_lbp_hog_selected.shape)
print(f'\033[1m\033[4mColor histogram/lbp/hog/contour/edge features with test/val dataset with \
{num_images_per_class} samples\033[0m')
train_and_evaluate(np.array(X_train_color_hist_lbp_hog_selected).reshape(len(X_train_color_hist_lbp_hog_selected)
, -1),
np.array(y_train),
np.array(X_test_color_hist_lbp_hog_selected).reshape(len(X_test_color_hist_lbp_hog_selected)
, -1),
np.array(y_test),
np.array(X_val_color_hist_lbp_hog_selected).reshape(len(X_val_color_hist_lbp_hog_selected)
, -1),
np.array(y_val),gridsrch,optslist)
else:
# Check the shape of the selected features
print("X_train selected shape:", X_train_color_hist_lbp_hog.shape)
print("X_test selected shape:", X_test_color_hist_lbp_hog.shape)
print("X_val selected shape:", X_val_color_hist_lbp_hog.shape)
print(f'\033[1m\033[4mColor histogram/lbp/hog/contour/edge features with test/val dataset with \
{num_images_per_class} samples\033[0m')
train_and_evaluate(np.array(X_train_color_hist_lbp_hog).reshape(len(X_train_color_hist_lbp_hog)
, -1),
np.array(y_train),
np.array(X_test_color_hist_lbp_hog).reshape(len(X_test_color_hist_lbp_hog)
, -1),
np.array(y_test),
np.array(X_val_color_hist_lbp_hog).reshape(len(X_val_color_hist_lbp_hog)
, -1),
np.array(y_val),gridsrch,optslist)
print()
print(f"Cumulative Time taken so far (seconds): {time.time() - st}")
print()
print("All Done!!")
except Exception as e:
print("Exception in mlfeaturizationandtraining ",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
# ... run your application ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print(" [ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
Here we reduce the dimensions using PCA techniques. We use SVD for Singular value decomposition of the image features generated using various techniques. We will primarily focus on the Combo feature since that is the main hypothesis of this entire experiment. However, we also have options to use Embeddings based model features.
def PCAbasedanalysis(num_images_per_class, parent_folder, k, whichftr,
alg, screeapp = True, gridsrch = False, Strat = True, en_y = False,
objective_trial= False, denoise = False, no_of_trials = 10):
"""
Singular Value Decomposition (SVD) is a computational method often employed to calculate principal components for a dataset.
Using SVD to perform PCA is efficient and numerically robust³.
Mathematically, given a real values data matrix `X` of size `n x p`, where `n` is the number of samples and `p` is the number of variables,
we can assume that it is centered, i.e. column means have been subtracted and are now equal to zero.
Then the `p x p` covariance matrix `C` is given by `C = X^T X / (n - 1)`.
It is a symmetric matrix and so it can be diagonalized: `C = VLV^T`, where `V` is a matrix of eigenvectors (each column is an eigenvector)
and `L` is a diagonal matrix with eigenvalues λi in the decreasing order on the diagonal.
The eigenvectors are called principal axes or principal directions of the data. Projections of the data on the principal
axes are called principal components, also known as PC scores; these can be seen as new, transformed, variables.
The j-th principal component is given by j-th column of `XV`. The coordinates of the i-th data point in the new PC
space are given by the i-th row of `XV`.
If we now perform singular value decomposition of `X`, we obtain a decomposition `X = UΣV^T`. The right singular vectors in matrix `V`
are equivalent to the eigenvectors of the covariance matrix, and the singular values in matrix Σ are equal to the square roots of the
eigenvalues of the covariance matrix¹.
Source: Conversation with Bing,
(1) How Are Principal Component Analysis and Singular Value ... - Intoli. https://intoli.com/blog/pca-and-svd/.
(2) Relationship between SVD and PCA.
How to use SVD to perform PCA?. https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca.
(3) What is the intuitive relationship between SVD and PCA?.
https://math.stackexchange.com/questions/3869/what-is-the-intuitive-relationship-between-svd-and-pca.
(4) Singular Value Decomposition (SVD) vs Principal Component Analysis (PCA ....
https://askanydifference.com/difference-between-singular-value-decomposition-svd-and-principal-component-analysis-pca-with-table/.
(5) Machine Learning — Singular Value Decomposition (SVD ... - Medium.
https://jonathan-hui.medium.com/machine-learning-singular-value-decomposition-svd-principal-component-analysis-pca-1d45e885e491.
"""
class Logger(object):
def __init__(self, filename="Default.log"):
self.terminal = sys.stdout
self.log = open(filename, "a")
def write(self, message):
self.terminal.write(message)
self.log.write(message)
def flush(self):
pass
def custom_scorer(estimator, X, y, threshold = 0.95):
try:
train_acc = estimator.score(X, y)
print('Y shape'. np.array(y).shape)
# Calculate the validation accuracy using cross-validation
val_acc = cross_val_score(estimator, X, y, cv=5).mean()
# Combine the training and validation accuracy in some way
# For example, you could take the average or the minimum of the two
score = (train_acc + val_acc) / 2
# If the training accuracy is greater than 0.95, return a low score
if train_acc > threshold or score > threshold: #, threshold = 0.95
return 0
except Exception as e:
print(str(e))
return score
def check_overfitting(train_accuracy, val_accuracy):
# Define a threshold for when the difference between the training and validation accuracy is too high
threshold = 0.05
# Calculate the difference between the training and validation accuracy
diff = train_accuracy - val_accuracy
# If the difference is greater than the threshold, the model may be overfitting
if diff > threshold:
print(f'The model may be overfitting because the difference between the training accuracy ({train_accuracy:.2f}) and validation accuracy ({val_accuracy:.2f}) is {diff:.2f}, which is greater than the threshold of {threshold}.')
# If the difference is less than or equal to the threshold, the model may be underfitting
elif diff < -threshold:
print(f'The model may be underfitting because the difference between the training accuracy ({train_accuracy:.2f}) and validation accuracy ({val_accuracy:.2f}) is {diff:.2f}, which is less than the negative threshold of {-threshold}.')
else:
print(f'The model seems to be fitting well because the difference between the training accuracy ({train_accuracy:.2f}) and validation accuracy ({val_accuracy:.2f}) is {diff:.2f}, which is within an acceptable range.')
def objective(trial):
"""
This function selects model optimizers and the best hyper parameters based on Trials
"""
#classifier_name = trial.suggest_categorical('classifier', ['SGD', 'SVC', 'RandomForest', 'GBM', 'AdaBoost', 'ExtraTrees', 'KNN'])
## Simplified modeling options for the trials
classifier_name = trial.suggest_categorical('classifier', ['SGD', 'SVC', 'RandomForest'])
if classifier_name == 'SGD':
sgd_loss = trial.suggest_categorical('sgd_loss', [ 'log_loss', 'modified_huber']) #, 'hinge','squared_hinge'
sgd_penalty = trial.suggest_categorical('sgd_penalty', ['l2', 'l1', 'elasticnet'])
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
sgd_max_iter = trial.suggest_int('sgd_max_iter', 10000, 70000)
sgd_tol = trial.suggest_float('sgd_tol', 1e-5, 1e-1)
classifier_obj = SGDClassifier(loss=sgd_loss, penalty=sgd_penalty, alpha=sgd_alpha,
max_iter=sgd_max_iter, tol=sgd_tol)
"""sgd_learning_rate = trial.suggest_categorical('sgd_learning_rate', ['optimal','invscaling','adaptive'])
classifier_obj = SGDClassifier(loss=sgd_loss, penalty=sgd_penalty, alpha=sgd_alpha,
max_iter=sgd_max_iter, tol=sgd_tol, learning_rate=sgd_learning_rate)
sgd_eta0 = trial.suggest_loguniform('sgd_eta0', 1e-5, 1e-1)
sgd_power_t = trial.suggest_float('sgd_power_t', 0.1, 1.0)
classifier_obj = SGDClassifier(loss=sgd_loss, penalty=sgd_penalty, alpha=sgd_alpha,
l1_ratio=sgd_l1_ratio, max_iter=sgd_max_iter, tol=sgd_tol,
learning_rate=sgd_learning_rate
, eta0=sgd_eta0, power_t=sgd_power_t)"""
elif classifier_name == 'SVC':
svc_c = trial.suggest_loguniform('svc_c', 1e-5, 1e10)
svc_kernel = trial.suggest_categorical('svc_kernel', ['linear', 'poly', 'rbf', 'sigmoid'])
svc_degree = trial.suggest_int('svc_degree', 1, 3)
svc_gamma = trial.suggest_categorical('svc_gamma', ['scale', 'auto'])
svc_coef0 = trial.suggest_float('svc_coef0', 0.0, 1.0)
svc_shrinking = trial.suggest_categorical('svc_shrinking', [True, False])
svc_max_iter = trial.suggest_int('svc_max_iter', 2000, 100000)
classifier_obj = SVC(C=svc_c, kernel=svc_kernel, degree=svc_degree,
gamma=svc_gamma, coef0=svc_coef0, shrinking=svc_shrinking,
max_iter = svc_max_iter, probability=True)
elif classifier_name == 'RandomForest':
rf_n_estimators = trial.suggest_int('rf_n_estimators', 10, 4000)
rf_max_depth = trial.suggest_int('rf_max_depth', 2, 32)
rf_min_samples_split = trial.suggest_int('rf_min_samples_split', 2, 10)
rf_min_samples_leaf = trial.suggest_int('rf_min_samples_leaf', 1, 10)
rf_max_features = trial.suggest_categorical('rf_max_features', ['sqrt', 'log2'])
classifier_obj = RandomForestClassifier(n_estimators=rf_n_estimators, max_depth=rf_max_depth,
min_samples_split=rf_min_samples_split,
min_samples_leaf=rf_min_samples_leaf,
max_features=rf_max_features)
"""elif classifier_name == 'GBM':
gbm_n_estimators = trial.suggest_int('gbm_n_estimators', 10, 5000)
gbm_learning_rate = trial.suggest_loguniform('gbm_learning_rate', 1e-5, 1)
gbm_max_depth = trial.suggest_int('gbm_max_depth', 2, 32)
gbm_min_samples_split = trial.suggest_int('gbm_min_samples_split', 2, 10)
gbm_min_samples_leaf = trial.suggest_int('gbm_min_samples_leaf', 1, 10)
gbm_subsample = trial.suggest_float('gbm_subsample', 0.5, 1.0)
classifier_obj = GradientBoostingClassifier(n_estimators=gbm_n_estimators,
learning_rate=gbm_learning_rate,
max_depth=gbm_max_depth,
min_samples_split=gbm_min_samples_split,
min_samples_leaf=gbm_min_samples_leaf,
subsample=gbm_subsample,
validation_fraction=0.2,
n_iter_no_change=5,
tol=0.01)
elif classifier_name == 'AdaBoost':
ada_n_estimators = trial.suggest_int('ada_n_estimators', 10, 6000)
ada_learning_rate = trial.suggest_loguniform('ada_learning_rate', 1e-5, 1)
ada_algorithm = trial.suggest_categorical('ada_algorithm', ['SAMME', 'SAMME.R'])
ada_random_state = trial.suggest_int('ada_random_state', 0, 100)
classifier_obj = AdaBoostClassifier(n_estimators=ada_n_estimators, learning_rate=ada_learning_rate,
algorithm=ada_algorithm, random_state=ada_random_state)
elif classifier_name == 'ExtraTrees':
et_n_estimators = trial.suggest_int('et_n_estimators', 10, 10000)
et_max_depth = trial.suggest_int('et_max_depth', 2, 32)
et_min_samples_split = trial.suggest_int('et_min_samples_split', 2, 10)
et_min_samples_leaf = trial.suggest_int('et_min_samples_leaf', 1, 10)
classifier_obj = ExtraTreesClassifier(n_estimators=et_n_estimators, max_depth=et_max_depth,
min_samples_split=et_min_samples_split, min_samples_leaf=et_min_samples_leaf)
else:
knn_n_neighbors = trial.suggest_int('knn_n_neighbors', 1, X_train_pca.shape[0] // 2)
knn_weights = trial.suggest_categorical('knn_weights', ['uniform', 'distance'])
knn_p = trial.suggest_int('knn_p', 1, 2)
classifier_obj = KNeighborsClassifier(n_neighbors=knn_n_neighbors, weights=knn_weights,
p=knn_p)"""
# Evaluate model using cross-validation
cv_scores = cross_val_score(classifier_obj,
X_train_pca,
y_train_encoded,
cv = 7,
n_jobs=-1,
verbose = 15) # n_jobs=-1 to use all available cores for faster computation
cv_score = np.mean(cv_scores)
if cv_score > 0.95:
print(f"CV score is greater than the threshold {cv_score}")
print()
return 0.0
# Calculate AUC ROC for each fold of the cross-validation
auc_roc_scores = []
cv = StratifiedKFold(n_splits=7)
for train_index, test_index in cv.split(X_train_pca, y_train_encoded):
X_train_cv, X_test_cv = X_train_pca[train_index], X_train_pca[test_index]
y_train_cv, y_test_cv = y_train_encoded[train_index], y_train_encoded[test_index]
classifier_obj.fit(X_train_cv, y_train_cv)
y_pred_proba = classifier_obj.predict_proba(X_test_cv)[:, 1]
auc_roc_scores.append(roc_auc_score(y_test_cv, y_pred_proba))
"""if score_val > best_score:
best_score = score_val
best_params = classifier_obj.get_params()
return score_val"""
# Calculate the average AUC ROC score
auc_roc_score = np.mean(auc_roc_scores)
print(f"The AUC ROC score is {auc_roc_score}")
print()
# Return a low score if the AUC ROC is below the threshold
if auc_roc_score > 0.90:
print(f"The AUC ROC score is beyond the threshold {auc_roc_score}")
print()
return 0.0
# Use a weighted average of the cross-validation score and the AUC ROC score as the final score
final_score = 0.5 * cv_score + 0.5 * auc_roc_score
print("Final Score from the objective function",final_score)
print()
return final_score
try:
tracemalloc.start() # start trace malloc
## Logging to a file based log
sys.stdout = Logger("D:\\logs\\logging_verbose10.txt")
# Create a directory to store the feature files
feature_dir = 'features' + str(num_images_per_class)
# Create the full path to the feature directory
feature_path = os.path.join(parent_folder, feature_dir)
start = time.time()
print(f"Loading features from {feature_path}")
print()
# Load the data
if whichftr == 'all' or whichftr == 'five':
X_train = np.load(os.path.join(feature_path, 'X_train_pca.npy')).astype(np.float32)
y_train = np.load(os.path.join(feature_path,'Y_train.npy'))
X_test = np.load(os.path.join(feature_path, 'X_test_pca.npy')).astype(np.float32)
y_test = np.load(os.path.join(feature_path,'Y_test.npy'))
X_val = np.load(os.path.join(feature_path, 'X_val_pca.npy')).astype(np.float32)
y_val = np.load(os.path.join(feature_path,'Y_val.npy'))
if whichftr == 'all' or whichftr == 'two':
X_train = np.load(os.path.join(feature_path, 'X_train_cnn_features_resnet.npy')).astype(np.float32)
y_train = np.load(os.path.join(feature_path,'Y_train.npy'))
X_test = np.load(os.path.join(feature_path, 'X_test_cnn_features_resnet.npy')).astype(np.float32)
y_test = np.load(os.path.join(feature_path,'Y_test.npy'))
X_val = np.load(os.path.join(feature_path, 'X_val_cnn_features_resnet.npy')).astype(np.float32)
y_val = np.load(os.path.join(feature_path,'Y_val.npy'))
if whichftr == 'all' or whichftr == 'three':
X_train = np.load(os.path.join(feature_path, 'X_train_cnn.npy')).astype(np.float32)
y_train = np.load(os.path.join(feature_path,'Y_train.npy'))
X_test = np.load(os.path.join(feature_path, 'X_test_cnn.npy')).astype(np.float32)
y_test = np.load(os.path.join(feature_path,'Y_test.npy'))
X_val = np.load(os.path.join(feature_path, 'X_val_cnn.npy')).astype(np.float32)
y_val = np.load(os.path.join(feature_path,'Y_val.npy'))
if whichftr == 'all' or whichftr == 'four':
X_train = np.load(os.path.join(feature_path, 'X_train_color_hist_lbp_hog.npy')).astype(np.float32)
y_train = np.load(os.path.join(feature_path,'Y_train.npy'))
X_test = np.load(os.path.join(feature_path, 'X_test_color_hist_lbp_hog.npy')).astype(np.float32)
y_test = np.load(os.path.join(feature_path,'Y_test.npy'))
X_val = np.load(os.path.join(feature_path, 'X_val_color_hist_lbp_hog.npy')).astype(np.float32)
y_val = np.load(os.path.join(feature_path,'Y_val.npy'))
if whichftr == 'all' or whichftr == 'one':
X_train = np.load(os.path.join(feature_path, 'X_train_cnn_features224_vgg.npy')).astype(np.float32)
y_train = np.load(os.path.join(feature_path,'Y_train.npy'))
X_test = np.load(os.path.join(feature_path, 'X_test_cnn_features224_vgg.npy')).astype(np.float32)
y_test = np.load(os.path.join(feature_path,'Y_test.npy'))
X_val = np.load(os.path.join(feature_path, 'X_val_cnn_features224_vgg.npy')).astype(np.float32)
y_val = np.load(os.path.join(feature_path,'Y_val.npy'))
print()
print("Feature set sizes:")
print(f"Training: {np.array(X_train).shape}")
print(f"Test: {np.array(X_test).shape}")
print(f"Validation: {np.array(X_val).shape}")
print()
print("Features loaded. Cumulative Time taken in seconds", time.time() - start)
print()
# Compute the mean image
mean_image = np.mean(X_train, axis=0)
# Subtract the mean image from the training data
X_train_centered = X_train - mean_image
print("Co-variance matrix calculation. Cumulative Time taken in seconds", time.time() - start)
print()
# Compute the covariance matrix
# Compute the SVD of the centered training data
U, S, Vt = svd(X_train_centered, full_matrices=False)
print("SVD processing . Cumulative Time taken in seconds", time.time() - start)
print()
if screeapp: ## If instead of using the hardcoded No of PCs we want to select PCs based on Explained Variance in the dataset
eigenvalues = S ** 2 / (X_train.shape[0] - 1)
# Plot the scree plot
plt.plot(np.arange(1, len(eigenvalues) + 1), eigenvalues)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')
plt.show()
# Determine the ideal number of PCs
cumulative_variance = np.cumsum(eigenvalues) / np.sum(eigenvalues)
ideal_num_pcs = np.argmax(cumulative_variance >= 0.95) + 1
print()
print(f'Ideal number of PCs: {ideal_num_pcs}')
print()
k = ideal_num_pcs
# Select the top k principal components
V = Vt.T
top_k_components = V[:, :k]
print(f"Top {k} components. Cumulative Time taken in seconds", time.time() - start)
print()
# Project the input data on the top k principal components
X_train_pca = X_train_centered.dot(top_k_components)
X_test_pca = (X_test - mean_image).dot(top_k_components)
X_val_pca = (X_val - mean_image).dot(top_k_components)
# Apply a median filter to the data to further de-noise the data post PCA
if denoise:
X_train_pca_denoised = median_filter(X_train_pca, size=3)
X_test_pca_denoised = median_filter(X_test_pca, size=3)
X_val_pca_denoised = median_filter(X_val_pca, size=3)
## In both cases, we need to encode Y
if en_y or objective_trial:
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.fit_transform(y_test)
y_val_encoded = le.fit_transform(y_val)
print("Calling fit transform to encode the labels. Cumulative Time taken in seconds", time.time() - start)
print()
scorer_uni = {
'accuracy': make_scorer(accuracy_score),
'roc_auc': make_scorer(roc_auc_score),
'precision': make_scorer(precision_score),
'recall': make_scorer(recall_score),
'f1': make_scorer(f1_score),
'average_precision': make_scorer(average_precision_score)
}
## Now that we have the data to use for training, we will use 2 methods to grid search and model selection
## First we will use OPTUNE package that helps find best params from a range
## Next we will also manually select a list of params to rune from & use regular GridSearchCV
if objective_trial:
print("Now running Objective trials using OPTUNA.")
print()
# Create study and optimize objective function
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials = no_of_trials)
# Print best hyperparameters
print("The best hyperparams found:", study.best_params)
# Get the best hyperparameters from the study
best_params = study.best_params
print("Best Classifier:",best_params['classifier'])
print()
print("Cumulative Time taken in seconds", time.time() - start)
print()
# Get the best classifier name from the study
best_classifier_name = best_params['classifier']
# Remove the classifier name from the best_params dictionary
del best_params['classifier']
print("Best hyper-parameters:", best_params)
print()
# Define a dictionary to map the trial parameter names to the classifier parameter names
param_map = {
'svc_c': 'C',
'svc_kernel': 'kernel',
'svc_degree': 'degree',
'svc_gamma': 'gamma',
'svc_max_iter': 'max_iter',
'svc_shrinking': 'shrinking',
'svc_coef0': 'coef0',
'probability': 'probability',
'rf_n_estimators': 'n_estimators',
'rf_max_depth': 'max_depth',
'rf_min_samples_split': 'min_samples_split',
'rf_min_samples_leaf': 'min_samples_leaf',
'gbm_n_estimators': 'n_estimators',
'gbm_learning_rate': 'learning_rate',
'gbm_max_depth': 'max_depth',
'gbm_min_samples_split': 'min_samples_split',
'gbm_min_samples_leaf': 'min_samples_leaf',
'gbm_subsample': 'subsample',
'ada_n_estimators': 'n_estimators',
'ada_learning_rate': 'learning_rate',
'ada_algorithm': 'algorithm',
'ada_random_state': 'random_state',
'et_n_estimators': 'n_estimators',
'et_max_depth': 'max_depth',
'et_min_samples_split': 'min_samples_split',
'et_min_samples_leaf': 'min_samples_leaf',
'knn_n_neighbors': 'n_neighbors',
'knn_weights':'weights','knn_p':'p',
'sgd_loss': 'loss',
'sgd_penalty': 'penalty',
'sgd_alpha': 'alpha',
'sgd_max_iter': 'max_iter',
'sgd_tol': 'tol',
'sgd_learning_rate': 'learning_rate',
'sgd_eta0': 'eta0',
'sgd_power_t': 'power_t'
}
# Use the param_map to rename the keys of the best_params dictionary
#best_params = {param_map[key]: value for key, value in best_params.items()}
best_params = {param_map.get(key, key): value for key, value in best_params.items()}
print("The params:",best_classifier_name, best_params)
print()
# Use the best hyperparameters to create your model -- simplify the list of optimizers
if best_classifier_name == "SGD":
#best_params['Probability'] = [True]
clf_best = SGDClassifier(**best_params)
elif best_classifier_name == "SVC":
#best_params['Probability'] = [True]
clf_best = SVC(**best_params)
elif best_classifier_name == "RandomForest":
clf_best = RandomForestClassifier(**best_params)
"""elif best_classifier_name == "GBM":
clf_best = GradientBoostingClassifier(**best_params)
elif best_classifier_name == "AdaBoost":
clf_best = AdaBoostClassifier(**best_params)
elif best_classifier_name == "ExtraTrees":
clf_best = ExtraTreesClassifier(**best_params)
else:
clf_best = KNeighborsClassifier(**best_params)"""
# Fit the model to the training data
if denoise:
clf_best.fit(X_train_pca_denoised, y_train) #y_train_encoded
else:
clf_best.fit(X_train_pca, y_train) #y_train_encoded
print("Model fit with best model and hyper parameters done!")
print()
print("Now doing predictions")
print()
# Evaluate the model on the test data
if denoise:
test_accuracy = clf_best.score(X_test_pca_denoised, y_test)
print(f'Test accuracy from objective function approach _denoised_ : {test_accuracy:.3f}')
print()
val_accuracy = clf_best.score(X_val_pca_denoised, y_val)
print(f'Validation accuracy from objective function approach _denoised_: {val_accuracy:.3f}')
print()
y_train_pred = clf_best.predict(X_train_pca_denoised)
y_train_prob = clf_best.predict_proba(X_train_pca_denoised)[:, 1]
y_pred = clf_best.predict(X_test_pca_denoised)
y_prob = clf_best.predict_proba(X_test_pca_denoised)[:, 1]
y_val_pred = clf_best.predict(X_val_pca_denoised)
y_val_prob = clf_best.predict_proba(X_val_pca_denoised)[:, 1]
else:
test_accuracy = clf_best.score(X_test_pca, y_test)
print(f'Test accuracy from objective function approach: {test_accuracy:.3f}')
print()
val_accuracy = clf_best.score(X_val_pca, y_val)
print(f'Validation accuracy from objective function approach: {val_accuracy:.3f}')
print()
y_train_pred = clf_best.predict(X_train_pca)
y_train_prob = clf_best.predict_proba(X_train_pca)[:, 1]
y_pred = clf_best.predict(X_test_pca)
y_prob = clf_best.predict_proba(X_test_pca)[:, 1]
y_val_pred = clf_best.predict(X_val_pca)
y_val_prob = clf_best.predict_proba(X_val_pca)[:, 1]
#return
else: ## Not using the objective function approach
print("Now running SVC/SVM or SGD fit. Cumulative Time taken in seconds so far", time.time() - start)
print()
# Either do gridsearch or manually select Hyper Parameters in this section
if gridsrch:
if alg == 'SVM':
svm = SVC()
#scorer_svm = make_scorer(custom_scorer, greater_is_better = True, threshold = 0.92)
param_grid = {
'C': [0.1, 0.01, 1, 10, 100, 1000, 10000, 100000, 1000000],
'kernel': ['rbf', 'sigmoid'], #'linear', 'poly',
'degree': [1, 2, 3],
'gamma': ['scale', 'auto', 0.001, 0.0001, 0.01, 1, 10, 100, 1000, 10000], #+ list(np.logspace(-5, 3, num=5)),
'coef0': np.linspace(-1, 1, num=11),
'shrinking': [True], #, False
'probability': [True],
'tol': np.logspace(-6, -3, num=4),
'class_weight': ['balanced']
}
if Strat:
cv = StratifiedKFold(n_splits = 5)
else:
cv = 3
try:
#scoring = scorer_svm,
if en_y:
clf = GridSearchCV(svm, param_grid, scoring = scorer_uni,
refit='accuracy', cv = cv, n_jobs=-1, verbose = 15)
clf.fit(X_train_pca, y_train_encoded)
print("Automatically selected best Grid Params",clf.best_params_)
else:
clf = GridSearchCV(svm, param_grid, cv = cv, n_jobs=-1, verbose = 15)
clf.fit(X_train_pca, y_train) #y_train_encoded
except Exception as e:
print("From grid search", str(e))
return
elif alg == 'SGD':
#scorer_sgd = make_scorer(custom_scorer, greater_is_better=True, threshold = 0.92)
sgd = SGDClassifier(loss='log_loss', max_iter=300)
param_grid = {
'loss': ['hinge', 'log_loss', 'perceptron', 'modified_huber'],
'alpha': [0.01, 1, 0.1, 0.001, 0.00001, 0.000001], #, 10, 100
'penalty': ['l1', 'elasticnet','l2'], #, 'l1', 'elasticnet'
'l1_ratio': np.linspace(0, 1, num=5).tolist() + [0.15], #np.linspace(0, 1, num=5),
'max_iter': [10000, 25000, 50000, 100000], #1000, 10000
'tol': np.logspace(-6, -3, num=3),
'learning_rate': ['optimal','invscaling','adaptive'], #, 'invscaling', 'adaptive'
'eta0': np.logspace(-5, -1, num=3),
'power_t': np.linspace(0.1, 1.0, num=5)
}
#param_grid_iterator = tqdm(list(ParameterGrid(param_grid)))
try:
if en_y:
clf = GridSearchCV(sgd , param_grid, scoring = scorer_uni,
refit='accuracy', cv = 3, n_jobs=-1, verbose = 13)
clf.fit(X_train_pca, y_train_encoded)
print("Automatically selected best Grid Params",clf.best_params_)
else:
clf = GridSearchCV(sgd , param_grid, cv = 3, n_jobs=-1, verbose = 13)
clf.fit(X_train_pca, y_train)
except Exception as e:
print("From grid search SGD", str(e))
return
print("Grid search Cumulative Time taken in seconds", time.time() - start)
print()
print("Now selecting the chosen params manually based on metrics and retraining the model on those")
print()
# Access the cv_results_ attribute
cv_results = clf.cv_results_
df_cv_results = pd.DataFrame(cv_results)
#print("CV Results")
#print(df_cv_results)
print()
#print(df_cv_results.columns)
#print()
df_cv_results.columns = df_cv_results.columns.str.strip()
valid_params = []
for i in range(len(df_cv_results)):
params = df_cv_results.loc[i, 'params']
mean_test_score = df_cv_results.loc[i, 'mean_test_score']
if 0.85 <= mean_test_score <= 0.90:
valid_params.append({'params': params, 'mean_test_score': mean_test_score})
df_valid_params = pd.DataFrame(valid_params)
# Sort the dataframe by the 'mean_test_score' column in descending order
df_valid_params = df_valid_params.sort_values(by='mean_test_score', ascending=False)
# Display only the top 5 rows
print("Top 5 Valid Params")
print(df_valid_params.head(5))
print()
print()
# Choose the best combination of parameters from the valid parameter combinations
if valid_params:
valid_params = sorted(valid_params, key=lambda x: x['mean_test_score'], reverse=True)
best_params = valid_params[0]['params']
print(f'Manually Selected Best Grid Search parameters: {best_params}')
print()
# Create a new SVM model with the best parameters
if alg == 'SVM':
clf_best = SVC(**best_params)
elif alg == 'SGD':
clf_best = SGDClassifier(loss='log_loss', **best_params)
# Fit the model to the training data
clf_best.fit(X_train_pca, y_train)
# Evaluate the model on the test data
test_accuracy = clf_best.score(X_test_pca, y_test)
print(f'Test accuracy: {test_accuracy:.3f}')
print()
y_train_pred = clf_best.predict(X_train_pca)
y_train_prob = clf_best.predict_proba(X_train_pca)[:, 1]
y_pred = clf_best.predict(X_test_pca)
y_prob = clf_best.predict_proba(X_test_pca)[:, 1]
y_val_pred = clf_best.predict(X_val_pca)
y_val_prob = clf_best.predict_proba(X_val_pca)[:, 1]
else:
print('No parameter combination meets the specified criteria')
print()
return
else:
if alg == 'SVM':
# 'C': 100, 'class_weight': 'balanced', 'coef0': -1.0, 'gamma': 'scale',
# 'kernel': 'rbf', 'probability': True, 'shrinking': True, 'tol': 1e-06
clf_best = SVC(kernel='rbf', class_weight='balanced', probability=True)
elif alg == 'SGD':
clf_best = SGDClassifier(loss='log_loss', max_iter=100000, alpha= 1/100)
clf_best.fit(X_train_pca, y_train)
# Compute the training accuracy, recall, precision, and AUC-ROC
y_train_pred = clf_best.predict(X_train_pca)
y_train_prob = clf_best.predict_proba(X_train_pca)[:, 1]
y_pred = clf_best.predict(X_test_pca)
y_prob = clf_best.predict_proba(X_test_pca)[:, 1]
y_val_pred = clf_best.predict(X_val_pca)
y_val_prob = clf_best.predict_proba(X_val_pca)[:, 1]
print()
print("Singular SVC/SGD fit Cumulative Time taken in seconds", time.time() - start)
print()
print()
print("Now, plotting the output of training and saving models")
print()
pos_label = 'PNEUMONIA'
# Binarize the labels
y_train_bin = label_binarize(y_train, classes=['NORMAL', 'PNEUMONIA'])
y_train_pred_bin = label_binarize(y_train_pred, classes=['NORMAL', 'PNEUMONIA'])
y_test_bin = label_binarize(y_test, classes=['NORMAL', 'PNEUMONIA'])
y_pred_bin = label_binarize(y_pred, classes=['NORMAL', 'PNEUMONIA'])
y_val_bin = label_binarize(y_val, classes=['NORMAL', 'PNEUMONIA'])
y_val_pred_bin = label_binarize(y_val_pred, classes=['NORMAL', 'PNEUMONIA'])
# Compute the training accuracy, recall, precision, and AUC-ROC
train_accuracy = accuracy_score(y_train_bin, y_train_pred_bin)
train_recall = recall_score(y_train_bin, y_train_pred_bin)
train_precision = precision_score(y_train_bin, y_train_pred_bin)
train_auc_roc = roc_auc_score(y_train_bin, y_train_prob)
print(f'Training accuracy: {train_accuracy:.2f}')
print(f'Training recall: {train_recall:.2f}')
print(f'Training precision: {train_precision:.2f}')
print(f'Training AUC-ROC: {train_auc_roc:.2f}')
# Compute the test accuracy, recall, precision, and AUC-ROC
test_accuracy = accuracy_score(y_test_bin, y_pred_bin)
test_recall = recall_score(y_test_bin, y_pred_bin)
test_precision = precision_score(y_test_bin, y_pred_bin)
test_auc_roc = roc_auc_score(y_test_bin, y_prob)
print(f'Test accuracy: {test_accuracy:.2f}')
print(f'Test recall: {test_recall:.2f}')
print(f'Test precision: {test_precision:.2f}')
print(f'Test AUC-ROC: {test_auc_roc:.2f}')
# Compute the validation accuracy, recall, precision and AUC-ROC
val_accuracy=accuracy_score(y_val_bin,y_val_pred_bin)
val_recall=recall_score(y_val_bin,y_val_pred_bin)
val_precision=precision_score(y_val_bin,y_val_pred_bin)
val_auc_roc=roc_auc_score(y_val_bin,y_val_prob)
print(f'Validation accuracy: {val_accuracy:.2f}')
print(f'Validation recall: {val_recall:.2f}')
print(f'Validation precision: {val_precision:.2f}')
print(f'Validation AUC-ROC: {val_auc_roc:.2f}')
print("Accuracy calculation. Cumulative Time taken in seconds", time.time() - start)
print()
print()
print("Visualizing the results")
print()
class_names= ['NORMAL', 'PNEUMONIA']
class_index = list(clf_best.classes_).index('PNEUMONIA')
if denoise:
print()
print("ROC AUC for Test _denoised")
plot_roc_curve(np.array(y_test), np.array(clf_best.predict_proba(X_test_pca_denoised)[:, class_index]), class_names)
print()
print("ROC AUC for Validation _denoised")
plot_roc_curve(np.array(y_val), np.array(clf_best.predict_proba(X_val_pca_denoised)[:, class_index]), class_names)
else:
print()
print("ROC AUC for Test")
plot_roc_curve(np.array(y_test), np.array(clf_best.predict_proba(X_test_pca)[:, class_index]), class_names)
print()
print("ROC AUC for Validation")
plot_roc_curve(np.array(y_val), np.array(clf_best.predict_proba(X_val_pca)[:, class_index]), class_names)
try:
if 'PNEUMONIA' in np.unique(y_val_pred): #np.unique(y_val) == 7 and
print()
print("Printing how this model fairs for class PNEUMONIA for test")
evaluate_class(y_pred, y_test, 'PNEUMONIA', class_names)
print()
print("Printing how this model fairs for class PNEUMONIA for validation")
evaluate_class(y_val_pred, y_val, 'PNEUMONIA', class_names)
except Exception as e:
print()
print("From the PCA analysis function class evaluation",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
print()
print("Confusion Matrix for Test")
plot_confusion_matrix(np.array(y_test), np.array(y_pred), class_names)
print()
print("Confusion Matrix for Validation")
plot_confusion_matrix(np.array(y_val), np.array(y_val_pred), class_names)
print()
print("Memory Usage:")
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage is {current / 10**6}MB; Peak was {peak / 10**6}MB")
tracemalloc.stop()
# Call this function with your calculated metrics
check_overfitting(train_accuracy, test_accuracy)
print()
print("All completed! Cumulative Time taken in seconds", time.time() - start)
print()
except Exception as e:
print("From the PCA analysis function ",str(e))
tb = traceback.extract_tb(e.__traceback__)
filename, line, func, text = tb[-1]
print(f'An exception occurred in file {filename}, line {line}, in {func}')
print(f'Code: {text}')
print(f'Exception: {str(e)}')
In this section, we first extract Features and store them to save compute & then run mutiple experiments to find the right fit on this Chest X ray dataset for Binary X Ray Image Classification
Here we generate features from images and save them. This is a fairly compute intensive process and hence should not be done every time we want to run an ML experiment.
These features will be saved as .npy for future use
howmanyimages() ## To prove we are using originals for Test and Validation
## Running to extract feature for 400 sample images
# The approach is to gte above 60% Accuracy with some model of the 6 with minimal amount of data
# So that this method can be used to train models and inference models on small devices with low resources
main_folder_path = './chest_xray'
parent_folder = './'
mlfeaturizationandtraining(500, main_folder_path, parent_folder, gridsrch = False, load = False, tne = False,
color_space_clhce = 'BGR')
For Training
Number of files per class in the main folder: {'NORMAL': 1344, 'PNEUMONIA': 3874}. We will pick a few out of these for training, test & validation
For Test
Number of files per class in the main folder: {'NORMAL': 234, 'PNEUMONIA': 390}. We will pick a few out of these for training, test & validation
For Validation
Number of files per class in the main folder: {'NORMAL': 8, 'PNEUMONIA': 8}. We will pick a few out of these for training, test & validation
100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 133.50it/s] 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 665.66it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1995.86it/s]
Sizes of Train, Test & Validation arrays 1000 624 16
Will save the .npy files in ./features500
Extracting features for Training
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
==> Extracting CNN features with default image size without resizing
100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [09:26<00:00, 1.77it/s]
Without Resizing VGG16 feature extraction completed (1000, 230400) ==> Extracting CNN 224 x 224 features with resizing
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn( 100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [03:33<00:00, 4.69it/s] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
With Resizing VGG16 feature extraction completed ==> Extracting Resnet50 224 x 224 features with resize
100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [12:45<00:00, 1.31it/s]
With Resizing RESNET50 feature extraction completed ==> Extracting Color Hist, LBP, Hog, Contour & Edge features stacked into one
100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [17:35<00:00, 1.06s/it]
Cumulative Time taken for extracing all complex features (seconds): 1055.272331237793 Printing the shapes of the feature arrays before calling hstack CHIST (1000, 512) LBP (1000, 1800000) HOG (1000, 2217780) CONTOUR (1000,) EDGE (1000,) Cumulative Time taken so far for stacking all features (seconds): 1073.4372310638428 Size of the CLHCE feature vector (1000, 4018294) ===>Saving in ./features500 Extracting features for Testing
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
==> Extracting CNN features with default image size without resizing
100%|████████████████████████████████████████████████████████████████████████████████| 624/624 [21:30<00:00, 2.07s/it] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
Without Resizing VGG16 feature extraction completed (624, 230400) ==> Extracting CNN 224 x 224 features with resizing
100%|████████████████████████████████████████████████████████████████████████████████| 624/624 [06:23<00:00, 1.63it/s] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
With Resizing VGG16 feature extraction completed ==> Extracting Resnet50 224 x 224 features with resize
100%|████████████████████████████████████████████████████████████████████████████████| 624/624 [19:41<00:00, 1.89s/it]
With Resizing RESNET50 feature extraction completed ==> Extracting Color Hist, LBP, Hog, Contour & Edge features stacked into one
100%|████████████████████████████████████████████████████████████████████████████████| 624/624 [11:09<00:00, 1.07s/it]
Cumulative Time taken for extracing all complex features (seconds): 669.2219967842102 Printing the shapes of the feature arrays before calling hstack CHIST (624, 512) LBP (624, 1800000) HOG (624, 2217780) CONTOUR (624,) EDGE (624,) Cumulative Time taken so far for stacking all features (seconds): 679.268238067627 Size of the CLHCE feature vector (624, 4018294) ===>Saving in ./features500 Extracting features for Validation
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
==> Extracting CNN features with default image size without resizing
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [01:27<00:00, 5.45s/it] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
Without Resizing VGG16 feature extraction completed (16, 230400) ==> Extracting CNN 224 x 224 features with resizing
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:15<00:00, 1.01it/s] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
With Resizing VGG16 feature extraction completed ==> Extracting Resnet50 224 x 224 features with resize
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:36<00:00, 2.26s/it]
With Resizing RESNET50 feature extraction completed ==> Extracting Color Hist, LBP, Hog, Contour & Edge features stacked into one
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:17<00:00, 1.08s/it]
Cumulative Time taken for extracing all complex features (seconds): 17.219820737838745
Printing the shapes of the feature arrays before calling hstack
CHIST (16, 512)
LBP (16, 1800000)
HOG (16, 2217780)
CONTOUR (16,)
EDGE (16,)
Cumulative Time taken so far for stacking all features (seconds): 17.485583782196045
Size of the CLHCE feature vector (16, 4018294)
===>Saving in ./features500
Time taken for feature extraction 6659.989357709885 seconds
All Done!!
howmanyimages() ## To prove we are using originals for Test and Validation
## Running to extract feature for 400 sample images
# The approach is to gte above 60% Accuracy with some model of the 6 with minimal amount of data
# So that this method can be used to train models and inference models on small devices with low resources
main_folder_path = './chest_xray'
parent_folder = './'
mlfeaturizationandtraining(100, main_folder_path, parent_folder, gridsrch = False, load = False, tne = False,
color_space_clhce = 'BGR')
For Training
Number of files per class in the main folder: {'NORMAL': 1344, 'PNEUMONIA': 3874}. We will pick a few out of these for training, test & validation
For Test
Number of files per class in the main folder: {'NORMAL': 234, 'PNEUMONIA': 390}. We will pick a few out of these for training, test & validation
For Validation
Number of files per class in the main folder: {'NORMAL': 8, 'PNEUMONIA': 8}. We will pick a few out of these for training, test & validation
100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 496.34it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1985.47it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<?, ?it/s]
Sizes of Train, Test & Validation arrays 200 200 16
Will save the .npy files in ./features100
Extracting features for Training
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
==> Extracting CNN features with default image size without resizing
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [03:24<00:00, 1.02s/it] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
Without Resizing VGG16 feature extraction completed (200, 230400) ==> Extracting CNN 224 x 224 features with resizing
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [01:02<00:00, 3.21it/s] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
With Resizing VGG16 feature extraction completed ==> Extracting Resnet50 224 x 224 features with resize
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [01:35<00:00, 2.09it/s]
With Resizing RESNET50 feature extraction completed ==> Extracting Color Hist, LBP, Hog, Contour & Edge features stacked into one
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [03:18<00:00, 1.01it/s]
Cumulative Time taken for extracing all complex features (seconds): 198.23362827301025 Printing the shapes of the feature arrays before calling hstack CHIST (200, 512) LBP (200, 1800000) HOG (200, 2217780) CONTOUR (200,) EDGE (200,) Cumulative Time taken so far for stacking all features (seconds): 201.28517365455627 Size of the CLHCE feature vector (200, 4018294) ===>Saving in ./features100 Extracting features for Testing
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
==> Extracting CNN features with default image size without resizing
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [06:28<00:00, 1.94s/it] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
Without Resizing VGG16 feature extraction completed (200, 230400) ==> Extracting CNN 224 x 224 features with resizing
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [01:16<00:00, 2.62it/s] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
With Resizing VGG16 feature extraction completed ==> Extracting Resnet50 224 x 224 features with resize
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [02:05<00:00, 1.59it/s]
With Resizing RESNET50 feature extraction completed ==> Extracting Color Hist, LBP, Hog, Contour & Edge features stacked into one
100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [03:19<00:00, 1.00it/s]
Cumulative Time taken for extracing all complex features (seconds): 199.40874671936035 Printing the shapes of the feature arrays before calling hstack CHIST (200, 512) LBP (200, 1800000) HOG (200, 2217780) CONTOUR (200,) EDGE (200,) Cumulative Time taken so far for stacking all features (seconds): 202.62491369247437 Size of the CLHCE feature vector (200, 4018294) ===>Saving in ./features100 Extracting features for Validation
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
==> Extracting CNN features with default image size without resizing
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:36<00:00, 2.28s/it] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
Without Resizing VGG16 feature extraction completed (16, 230400) ==> Extracting CNN 224 x 224 features with resizing
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.38it/s] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
With Resizing VGG16 feature extraction completed ==> Extracting Resnet50 224 x 224 features with resize
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:11<00:00, 1.43it/s]
With Resizing RESNET50 feature extraction completed ==> Extracting Color Hist, LBP, Hog, Contour & Edge features stacked into one
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:16<00:00, 1.01s/it]
Cumulative Time taken for extracing all complex features (seconds): 16.157812356948853
Printing the shapes of the feature arrays before calling hstack
CHIST (16, 512)
LBP (16, 1800000)
HOG (16, 2217780)
CONTOUR (16,)
EDGE (16,)
Cumulative Time taken so far for stacking all features (seconds): 16.437570571899414
Size of the CLHCE feature vector (16, 4018294)
===>Saving in ./features100
Time taken for feature extraction 1536.6806309223175 seconds
All Done!!
howmanyimages() ## To prove we are using originals for Test and Validation
## Running to extract feature for 400 sample images
# The approach is to gte above 60% Accuracy with some model of the 6 with minimal amount of data
# So that this method can be used to train models and inference models on small devices with low resources
main_folder_path = './chest_xray'
parent_folder = './'
mlfeaturizationandtraining(250, main_folder_path, parent_folder, gridsrch = False, load = False, tne = False,
color_space_clhce = 'BGR')
For Training
Number of files per class in the main folder: {'NORMAL': 1344, 'PNEUMONIA': 3874}. We will pick a few out of these for training, test & validation
For Test
Number of files per class in the main folder: {'NORMAL': 234, 'PNEUMONIA': 390}. We will pick a few out of these for training, test & validation
For Validation
Number of files per class in the main folder: {'NORMAL': 8, 'PNEUMONIA': 8}. We will pick a few out of these for training, test & validation
Will save the .npy files in ./features250
100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 166.42it/s] 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 285.25it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1993.49it/s]
Sizes of Train, Test & Validation arrays 500 484 16
Extracting features for Training
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
==> Extracting CNN features with default image size without resizing
100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [05:00<00:00, 1.66it/s] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
Without Resizing VGG16 feature extraction completed (500, 230400) ==> Extracting CNN 224 x 224 features with resizing
100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [02:00<00:00, 4.16it/s] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
With Resizing VGG16 feature extraction completed ==> Extracting Resnet50 224 x 224 features with resize
100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [03:06<00:00, 2.68it/s]
With Resizing RESNET50 feature extraction completed ==> Extracting Color Hist, LBP, Hog, Contour & Edge features stacked into one
100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [26:54<00:00, 3.23s/it]
Cumulative Time taken for extracing all complex features (seconds): 1614.0860464572906 Printing the shapes of the feature arrays before calling hstack CHIST (500, 512) LBP (500, 1800000) HOG (500, 2217780) CONTOUR (500,) EDGE (500,) Cumulative Time taken so far for stacking all features (seconds): 1621.8457601070404 Size of the CLHCE feature vector (500, 4018294) ===>Saving in ./features250 Extracting features for Testing
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
==> Extracting CNN features with default image size without resizing
100%|████████████████████████████████████████████████████████████████████████████████| 484/484 [09:40<00:00, 1.20s/it] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
Without Resizing VGG16 feature extraction completed (484, 230400) ==> Extracting CNN 224 x 224 features with resizing
100%|████████████████████████████████████████████████████████████████████████████████| 484/484 [02:31<00:00, 3.19it/s] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
With Resizing VGG16 feature extraction completed ==> Extracting Resnet50 224 x 224 features with resize
100%|████████████████████████████████████████████████████████████████████████████████| 484/484 [05:41<00:00, 1.42it/s]
With Resizing RESNET50 feature extraction completed ==> Extracting Color Hist, LBP, Hog, Contour & Edge features stacked into one
100%|████████████████████████████████████████████████████████████████████████████████| 484/484 [25:47<00:00, 3.20s/it]
Cumulative Time taken for extracing all complex features (seconds): 1547.3641152381897 Printing the shapes of the feature arrays before calling hstack CHIST (484, 512) LBP (484, 1800000) HOG (484, 2217780) CONTOUR (484,) EDGE (484,) Cumulative Time taken so far for stacking all features (seconds): 1554.6848213672638 Size of the CLHCE feature vector (484, 4018294) ===>Saving in ./features250 Extracting features for Validation
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
==> Extracting CNN features with default image size without resizing
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:27<00:00, 1.72s/it] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
Without Resizing VGG16 feature extraction completed (16, 230400) ==> Extracting CNN 224 x 224 features with resizing
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:06<00:00, 2.48it/s] C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py:1056: InsecureRequestWarning: Unverified HTTPS request is being made to host 'storage.googleapis.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings warnings.warn(
With Resizing VGG16 feature extraction completed ==> Extracting Resnet50 224 x 224 features with resize
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:14<00:00, 1.12it/s]
With Resizing RESNET50 feature extraction completed ==> Extracting Color Hist, LBP, Hog, Contour & Edge features stacked into one
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:49<00:00, 3.08s/it]
Cumulative Time taken for extracing all complex features (seconds): 49.29555583000183
Printing the shapes of the feature arrays before calling hstack
CHIST (16, 512)
LBP (16, 1800000)
HOG (16, 2217780)
CONTOUR (16,)
EDGE (16,)
Cumulative Time taken so far for stacking all features (seconds): 49.54937195777893
Size of the CLHCE feature vector (16, 4018294)
===>Saving in ./features250
Time taken for feature extraction 5105.050082921982 seconds
All Done!!
In this section, we will run experiments with the extracted features. We will use Variance + K Best in one approach to reduce the feature set dimensionality and PCA + SVD in another case.
Note: We need to ensure that the model generalizes and does not leak features as well as high enough accuracy and Recall for both scenarios.
We will also do some grid search.
In this section, we call Optuna to finetune the best models
main_folder_path = './chest_xray'
parent_folder = './'
# objective_trial set to False must mean gridsrch set to True
PCAbasedanalysis(100, parent_folder, 150, 'four', 'SGD',
screeapp = True, gridsrch = False, Strat = False,
en_y = False, objective_trial = True, denoise = False)
Loading features from ./features100 Feature set sizes: Training: (200, 4018294) Test: (200, 4018294) Validation: (16, 4018294) Features loaded. Cumulative Time taken in seconds 6.249511957168579 Co-variance matrix calculation. Cumulative Time taken in seconds 7.088639497756958 SVD processing . Cumulative Time taken in seconds 64.79028534889221
Ideal number of PCs: 42 Top 42 components. Cumulative Time taken in seconds 65.3443911075592
[I 2023-09-02 19:05:11,255] A new study created in memory with name: no-name-6a37ab7f-80b9-4297-abe2-099cb998142a
Calling fit transform to encode the labels. Cumulative Time taken in seconds 67.37024593353271 Now running Objective trials using OPTUNA.
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_14012\2481846468.py:94: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.4s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 1.4s remaining: 3.6s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 1.4s remaining: 1.9s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 1.4s remaining: 1.0s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 1.4s remaining: 0.5s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 1.4s finished
[I 2023-09-02 19:05:12,845] Trial 0 finished with value: 0.0 and parameters: {'classifier': 'SGD', 'sgd_loss': 'log_loss', 'sgd_penalty': 'l1', 'sgd_alpha': 0.057880085332950165, 'sgd_max_iter': 37230, 'sgd_tol': 0.08099283412416179}. Best is trial 0 with value: 0.0.
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_14012\2481846468.py:94: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
The AUC ROC score is 0.9193877551020407 The AUC ROC score is beyond the threshold 0.9193877551020407
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.5s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 0.6s remaining: 1.6s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.6s remaining: 0.8s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.6s remaining: 0.4s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.6s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.6s finished
[I 2023-09-02 19:05:13,593] Trial 1 finished with value: 0.8629603565564157 and parameters: {'classifier': 'SGD', 'sgd_loss': 'modified_huber', 'sgd_penalty': 'l2', 'sgd_alpha': 13.623577619370879, 'sgd_max_iter': 13760, 'sgd_tol': 0.009828098764695016}. Best is trial 1 with value: 0.8629603565564157.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
The AUC ROC score is 0.8258503401360544 Final Score from the objective function 0.8629603565564157
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 3.5s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 3.5s remaining: 9.0s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 3.6s remaining: 4.8s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 3.6s remaining: 2.7s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 3.6s remaining: 1.4s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 3.7s finished
[I 2023-09-02 19:07:22,940] Trial 2 finished with value: 0.0 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 2286, 'rf_max_depth': 20, 'rf_min_samples_split': 2, 'rf_min_samples_leaf': 7, 'rf_max_features': 'log2'}. Best is trial 1 with value: 0.8629603565564157.
The AUC ROC score is 0.9544217687074831 The AUC ROC score is beyond the threshold 0.9544217687074831
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 5.1s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 5.1s remaining: 12.9s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 5.1s remaining: 6.8s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 5.1s remaining: 3.8s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 5.1s remaining: 2.0s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 5.2s finished
[I 2023-09-02 19:10:51,303] Trial 3 finished with value: 0.0 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 3674, 'rf_max_depth': 21, 'rf_min_samples_split': 2, 'rf_min_samples_leaf': 9, 'rf_max_features': 'log2'}. Best is trial 1 with value: 0.8629603565564157.
The AUC ROC score is 0.950971817298348 The AUC ROC score is beyond the threshold 0.950971817298348
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_14012\2481846468.py:94: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 out of 1 | elapsed: 0.2s finished
[Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 0.2s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.2s remaining: 0.3s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.2s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.2s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.2s finished
[I 2023-09-02 19:10:51,716] Trial 4 finished with value: 0.8375439831104855 and parameters: {'classifier': 'SGD', 'sgd_loss': 'modified_huber', 'sgd_penalty': 'l2', 'sgd_alpha': 0.05094631640817233, 'sgd_max_iter': 19910, 'sgd_tol': 0.04085194557946915}. Best is trial 1 with value: 0.8629603565564157.
The AUC ROC score is 0.8459183673469387
Final Score from the objective function 0.8375439831104855
The best hyperparams found: {'classifier': 'SGD', 'sgd_loss': 'modified_huber', 'sgd_penalty': 'l2', 'sgd_alpha': 13.623577619370879, 'sgd_max_iter': 13760, 'sgd_tol': 0.009828098764695016}
Best Classifier: SGD
Cumulative Time taken in seconds 407.8315863609314
Best hyper-parameters: {'sgd_loss': 'modified_huber', 'sgd_penalty': 'l2', 'sgd_alpha': 13.623577619370879, 'sgd_max_iter': 13760, 'sgd_tol': 0.009828098764695016}
The params: SGD {'loss': 'modified_huber', 'penalty': 'l2', 'alpha': 13.623577619370879, 'max_iter': 13760, 'tol': 0.009828098764695016}
Model fit with best model and hyper parameters done!
Now doing predictions
Test accuracy from objective function approach: 0.840
Validation accuracy from objective function approach: 0.812
Now, plotting the output of training and saving models
Training accuracy: 0.91
Training recall: 0.82
Training precision: 0.99
Training AUC-ROC: 0.90
Test accuracy: 0.84
Test recall: 0.84
Test precision: 0.84
Test AUC-ROC: 0.84
Validation accuracy: 0.81
Validation recall: 0.75
Validation precision: 0.86
Validation AUC-ROC: 0.81
Accuracy calculation. Cumulative Time taken in seconds 407.91955614089966
Visualizing the results
ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for test
Printing how this model fairs for class PNEUMONIA for validation
Confusion Matrix for Test
Confusion Matrix for Validation
Memory Usage: Current memory usage is 13136.434875MB; Peak was 22776.054041MB The model may be overfitting because the difference between the training accuracy (0.91) and validation accuracy (0.84) is 0.07, which is greater than the threshold of 0.05. Loading features from ./features500 Feature set sizes: Training: (1000, 4018294) Test: (624, 4018294) Validation: (16, 4018294) Features loaded. Cumulative Time taken in seconds 24.695274829864502 Co-variance matrix calculation. Cumulative Time taken in seconds 28.713603496551514
main_folder_path = './chest_xray'
parent_folder = './'
# objective_trial set to False must mean gridsrch set to True
PCAbasedanalysis(500, parent_folder, 150, 'four', 'SGD',
screeapp = True, gridsrch = False, Strat = False,
en_y = False, objective_trial = True, denoise = False)
Loading features from ./features500 Feature set sizes: Training: (1000, 4018294) Test: (624, 4018294) Validation: (16, 4018294) Features loaded. Cumulative Time taken in seconds 24.502930879592896 Co-variance matrix calculation. Cumulative Time taken in seconds 28.565422773361206 SVD processing . Cumulative Time taken in seconds 611.829975605011
Ideal number of PCs: 239 Top 239 components. Cumulative Time taken in seconds 612.3362863063812
[I 2023-09-02 19:53:21,580] A new study created in memory with name: no-name-da998db4-1c43-4cdb-806f-da88d0415a35
Calling fit transform to encode the labels. Cumulative Time taken in seconds 619.786614894867 Now running Objective trials using OPTUNA.
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_12732\2476190303.py:113: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
svc_c = trial.suggest_loguniform('svc_c', 1e-5, 1e10)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.7s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 1.7s remaining: 4.3s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 1.7s remaining: 2.3s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 1.7s remaining: 1.2s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 1.7s remaining: 0.6s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 1.7s finished
[I 2023-09-02 19:53:25,321] Trial 0 finished with value: 0.7048118666473159 and parameters: {'classifier': 'SVC', 'svc_c': 0.0004540478947842062, 'svc_kernel': 'poly', 'svc_degree': 2, 'svc_gamma': 'scale', 'svc_coef0': 0.23486163220565537, 'svc_shrinking': True, 'svc_max_iter': 12526}. Best is trial 0 with value: 0.7048118666473159.
The AUC ROC score is 0.882643671103302 Final Score from the objective function 0.7048118666473159
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 15.8s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 15.9s remaining: 40.0s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 16.0s remaining: 21.4s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 16.0s remaining: 12.0s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 16.1s remaining: 6.4s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 16.2s finished
[I 2023-09-02 19:57:02,741] Trial 1 finished with value: 0.0 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 2098, 'rf_max_depth': 14, 'rf_min_samples_split': 9, 'rf_min_samples_leaf': 5, 'rf_max_features': 'log2'}. Best is trial 0 with value: 0.7048118666473159.
The AUC ROC score is 0.9553848915086764 The AUC ROC score is beyond the threshold 0.9553848915086764
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 6.9s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 6.9s remaining: 17.5s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 7.0s remaining: 9.3s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 7.0s remaining: 5.2s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 7.0s remaining: 2.7s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 7.3s finished
[I 2023-09-02 19:58:31,876] Trial 2 finished with value: 0.0 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 840, 'rf_max_depth': 15, 'rf_min_samples_split': 8, 'rf_min_samples_leaf': 4, 'rf_max_features': 'log2'}. Best is trial 0 with value: 0.7048118666473159.
The AUC ROC score is 0.9553258518245623 The AUC ROC score is beyond the threshold 0.9553258518245623
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_12732\2476190303.py:113: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
svc_c = trial.suggest_loguniform('svc_c', 1e-5, 1e10)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 out of 1 | elapsed: 0.2s finished
[Parallel(n_jobs=-1)]: Done 2 tasks | elapsed: 0.4s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.4s remaining: 0.6s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.4s remaining: 0.3s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.4s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.4s finished
[I 2023-09-02 19:58:33,809] Trial 3 finished with value: 0.49206772255274533 and parameters: {'classifier': 'SVC', 'svc_c': 7028770860.29511, 'svc_kernel': 'sigmoid', 'svc_degree': 1, 'svc_gamma': 'scale', 'svc_coef0': 0.3264365202755518, 'svc_shrinking': True, 'svc_max_iter': 59221}. Best is trial 0 with value: 0.7048118666473159.
The AUC ROC score is 0.49718577505722916 Final Score from the objective function 0.49206772255274533
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_12732\2476190303.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.03207850456237793s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 0.0s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.2s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 10.7s finished
[I 2023-09-02 19:58:45,124] Trial 4 finished with value: 0.8740346117106681 and parameters: {'classifier': 'SGD', 'sgd_loss': 'log_loss', 'sgd_penalty': 'l1', 'sgd_alpha': 0.10986880176841844, 'sgd_max_iter': 32765, 'sgd_tol': 0.022130505284560942}. Best is trial 4 with value: 0.8740346117106681.
The AUC ROC score is 0.8641292197630227
Final Score from the objective function 0.8740346117106681
The best hyperparams found: {'classifier': 'SGD', 'sgd_loss': 'log_loss', 'sgd_penalty': 'l1', 'sgd_alpha': 0.10986880176841844, 'sgd_max_iter': 32765, 'sgd_tol': 0.022130505284560942}
Best Classifier: SGD
Cumulative Time taken in seconds 943.3341658115387
Best hyper-parameters: {'sgd_loss': 'log_loss', 'sgd_penalty': 'l1', 'sgd_alpha': 0.10986880176841844, 'sgd_max_iter': 32765, 'sgd_tol': 0.022130505284560942}
The params: SGD {'loss': 'log_loss', 'penalty': 'l1', 'alpha': 0.10986880176841844, 'max_iter': 32765, 'tol': 0.022130505284560942}
Model fit with best model and hyper parameters done!
Now doing predictions
Test accuracy from objective function approach: 0.824
Validation accuracy from objective function approach: 1.000
Now, plotting the output of training and saving models
Training accuracy: 0.91
Training recall: 0.83
Training precision: 0.99
Training AUC-ROC: 0.91
Test accuracy: 0.82
Test recall: 0.87
Test precision: 0.85
Test AUC-ROC: 0.81
Validation accuracy: 1.00
Validation recall: 1.00
Validation precision: 1.00
Validation AUC-ROC: 1.00
Accuracy calculation. Cumulative Time taken in seconds 943.4763724803925
Visualizing the results
ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for test
Printing how this model fairs for class PNEUMONIA for validation
Confusion Matrix for Test
Confusion Matrix for Validation
Memory Usage: Current memory usage is 58532.390546MB; Peak was 106750.056057MB The model may be overfitting because the difference between the training accuracy (0.91) and validation accuracy (0.82) is 0.09, which is greater than the threshold of 0.05.
main_folder_path = './chest_xray'
parent_folder = './'
# objective_trial set to False must mean gridsrch set to True
PCAbasedanalysis(500, parent_folder, 150, 'four', 'SGD',
screeapp = True, gridsrch = False, Strat = False,
en_y = False, objective_trial = True, denoise = False)
Loading features from ./features500 Feature set sizes: Training: (1000, 4018294) Test: (624, 4018294) Validation: (16, 4018294) Features loaded. Cumulative Time taken in seconds 24.35588002204895 Co-variance matrix calculation. Cumulative Time taken in seconds 28.41743540763855 SVD processing . Cumulative Time taken in seconds 611.5237085819244
Ideal number of PCs: 239 Top 239 components. Cumulative Time taken in seconds 612.0262908935547
[I 2023-09-02 20:13:36,006] A new study created in memory with name: no-name-ae728f1a-ece4-467e-871d-c783d5460097
Calling fit transform to encode the labels. Cumulative Time taken in seconds 619.4463477134705 Now running Objective trials using OPTUNA.
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_11656\1782647539.py:113: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
svc_c = trial.suggest_loguniform('svc_c', 1e-5, 1e10)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.7s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 1.7s remaining: 4.4s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 1.7s remaining: 2.3s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 1.7s remaining: 1.3s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 1.7s remaining: 0.6s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 1.7s finished
[I 2023-09-02 20:13:39,992] Trial 0 finished with value: 0.6308184227991807 and parameters: {'classifier': 'SVC', 'svc_c': 0.014503606056885889, 'svc_kernel': 'sigmoid', 'svc_degree': 2, 'svc_gamma': 'scale', 'svc_coef0': 0.381299856234101, 'svc_shrinking': False, 'svc_max_iter': 56889}. Best is trial 0 with value: 0.6308184227991807.
The AUC ROC score is 0.6806685181511606 Final Score from the objective function 0.6308184227991807
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_11656\1782647539.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.6s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 0.6s remaining: 1.5s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.6s remaining: 0.8s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.6s remaining: 0.4s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.6s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.6s finished
[I 2023-09-02 20:13:40,857] Trial 1 finished with value: 0.8690334997553308 and parameters: {'classifier': 'SGD', 'sgd_loss': 'modified_huber', 'sgd_penalty': 'l2', 'sgd_alpha': 3.083705451941911, 'sgd_max_iter': 36920, 'sgd_tol': 0.0423785538426143}. Best is trial 1 with value: 0.8690334997553308.
The AUC ROC score is 0.8980689693717865 Final Score from the objective function 0.8690334997553308
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_11656\1782647539.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.6s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 0.6s remaining: 1.6s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.6s remaining: 0.8s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.6s remaining: 0.4s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.6s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.6s finished
[I 2023-09-02 20:13:41,974] Trial 2 finished with value: 0.8595037591516465 and parameters: {'classifier': 'SGD', 'sgd_loss': 'log_loss', 'sgd_penalty': 'elasticnet', 'sgd_alpha': 0.021407638799743386, 'sgd_max_iter': 61857, 'sgd_tol': 0.08250585477394924}. Best is trial 1 with value: 0.8690334997553308.
The AUC ROC score is 0.8710596914822266 Final Score from the objective function 0.8595037591516465
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_11656\1782647539.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.6s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 0.6s remaining: 1.6s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.6s remaining: 0.8s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.6s remaining: 0.4s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.6s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.6s finished
[I 2023-09-02 20:13:42,863] Trial 3 finished with value: 0.8634715245630739 and parameters: {'classifier': 'SGD', 'sgd_loss': 'log_loss', 'sgd_penalty': 'l2', 'sgd_alpha': 1.6262185357112602, 'sgd_max_iter': 23736, 'sgd_tol': 0.022734531453030768}. Best is trial 1 with value: 0.8690334997553308.
The AUC ROC score is 0.867063492063492 Final Score from the objective function 0.8634715245630739
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_11656\1782647539.py:113: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
svc_c = trial.suggest_loguniform('svc_c', 1e-5, 1e10)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.8s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 0.8s remaining: 2.1s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.8s remaining: 1.1s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.8s remaining: 0.6s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.8s remaining: 0.3s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.8s finished
[I 2023-09-02 20:13:45,414] Trial 4 finished with value: 0.0 and parameters: {'classifier': 'SVC', 'svc_c': 29744.579030629415, 'svc_kernel': 'rbf', 'svc_degree': 3, 'svc_gamma': 'scale', 'svc_coef0': 0.8103055165713388, 'svc_shrinking': False, 'svc_max_iter': 61681}. Best is trial 1 with value: 0.8690334997553308.
The AUC ROC score is 0.9823471344498919 The AUC ROC score is beyond the threshold 0.9823471344498919
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_11656\1782647539.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.6s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 0.6s remaining: 1.6s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.6s remaining: 0.8s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.6s remaining: 0.4s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.6s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.7s finished
[I 2023-09-02 20:13:46,636] Trial 5 finished with value: 0.8805388664543594 and parameters: {'classifier': 'SGD', 'sgd_loss': 'log_loss', 'sgd_penalty': 'l1', 'sgd_alpha': 0.015851879572541534, 'sgd_max_iter': 29799, 'sgd_tol': 0.06507719942708264}. Best is trial 5 with value: 0.8805388664543594.
The AUC ROC score is 0.8731276548177956 Final Score from the objective function 0.8805388664543594
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_11656\1782647539.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.6s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 0.6s remaining: 1.6s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.6s remaining: 0.8s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.6s remaining: 0.4s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.6s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.6s finished
[I 2023-09-02 20:13:47,699] Trial 6 finished with value: 0.8815730924005573 and parameters: {'classifier': 'SGD', 'sgd_loss': 'log_loss', 'sgd_penalty': 'elasticnet', 'sgd_alpha': 0.000128058945107182, 'sgd_max_iter': 39594, 'sgd_tol': 0.09963547596828681}. Best is trial 6 with value: 0.8815730924005573.
The AUC ROC score is 0.8851861167002013 Final Score from the objective function 0.8815730924005573
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 4.7s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 4.7s remaining: 12.0s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 4.8s remaining: 6.4s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 4.8s remaining: 3.6s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 4.8s remaining: 1.9s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 4.9s finished
[I 2023-09-02 20:14:48,398] Trial 7 finished with value: 0.0 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 594, 'rf_max_depth': 17, 'rf_min_samples_split': 3, 'rf_min_samples_leaf': 8, 'rf_max_features': 'log2'}. Best is trial 6 with value: 0.8815730924005573.
The AUC ROC score is 0.9515370784960153 The AUC ROC score is beyond the threshold 0.9515370784960153
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_11656\1782647539.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0370635986328125s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 0.0s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.6s remaining: 0.8s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.6s remaining: 0.4s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.6s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.6s finished
[I 2023-09-02 20:14:49,253] Trial 8 finished with value: 0.8720930868818194 and parameters: {'classifier': 'SGD', 'sgd_loss': 'log_loss', 'sgd_penalty': 'l2', 'sgd_alpha': 0.00400305467193727, 'sgd_max_iter': 54723, 'sgd_tol': 0.05441790512470175}. Best is trial 6 with value: 0.8815730924005573.
The AUC ROC score is 0.8841940532081379 Final Score from the objective function 0.8720930868818194
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_11656\1782647539.py:113: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
svc_c = trial.suggest_loguniform('svc_c', 1e-5, 1e10)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.1s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.16293811798095703s.) Setting batch_size=2.
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 0.1s remaining: 0.3s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.1s remaining: 0.1s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.1s remaining: 0.0s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.1s finished
[I 2023-09-02 20:14:50,563] Trial 9 finished with value: 0.5976817601495246 and parameters: {'classifier': 'SVC', 'svc_c': 6915055.447256666, 'svc_kernel': 'sigmoid', 'svc_degree': 1, 'svc_gamma': 'auto', 'svc_coef0': 0.01922807062485421, 'svc_shrinking': False, 'svc_max_iter': 24096}. Best is trial 6 with value: 0.8815730924005573.
The AUC ROC score is 0.6133750861979388
Final Score from the objective function 0.5976817601495246
The best hyperparams found: {'classifier': 'SGD', 'sgd_loss': 'log_loss', 'sgd_penalty': 'elasticnet', 'sgd_alpha': 0.000128058945107182, 'sgd_max_iter': 39594, 'sgd_tol': 0.09963547596828681}
Best Classifier: SGD
Cumulative Time taken in seconds 694.0033128261566
Best hyper-parameters: {'sgd_loss': 'log_loss', 'sgd_penalty': 'elasticnet', 'sgd_alpha': 0.000128058945107182, 'sgd_max_iter': 39594, 'sgd_tol': 0.09963547596828681}
The params: SGD {'loss': 'log_loss', 'penalty': 'elasticnet', 'alpha': 0.000128058945107182, 'max_iter': 39594, 'tol': 0.09963547596828681}
Model fit with best model and hyper parameters done!
Now doing predictions
Test accuracy from objective function approach: 0.822
Validation accuracy from objective function approach: 0.875
Now, plotting the output of training and saving models
Training accuracy: 0.94
Training recall: 0.90
Training precision: 0.98
Training AUC-ROC: 0.94
Test accuracy: 0.82
Test recall: 0.89
Test precision: 0.84
Test AUC-ROC: 0.80
Validation accuracy: 0.88
Validation recall: 0.88
Validation precision: 0.88
Validation AUC-ROC: 0.88
Accuracy calculation. Cumulative Time taken in seconds 694.1310126781464
Visualizing the results
ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for test
Printing how this model fairs for class PNEUMONIA for validation
Confusion Matrix for Test
Confusion Matrix for Validation
Memory Usage: Current memory usage is 58532.485635MB; Peak was 106750.056625MB The model may be overfitting because the difference between the training accuracy (0.94) and validation accuracy (0.82) is 0.12, which is greater than the threshold of 0.05. All completed! Cumulative Time taken in seconds 697.1376330852509 Loading features from ./features1000
main_folder_path = './chest_xray'
parent_folder = './'
# objective_trial set to False must mean gridsrch set to True
# Complex Combo feature being used - C-Hist + LBP + HOG + Contour + Edges
PCAbasedanalysis(1000, parent_folder, 150, 'four', 'SGD',
screeapp = True, gridsrch = False, Strat = False,
en_y = False, objective_trial = True)
Loading features from ./features1000 Feature set sizes: Training: (2000, 4018294) Test: (624, 4018294) Validation: (16, 4018294) Features loaded. Cumulative Time taken in seconds 40.54073977470398 Co-variance matrix calculation. Cumulative Time taken in seconds 48.83466601371765 SVD processing . Cumulative Time taken in seconds 1270.1206395626068
Ideal number of PCs: 465 Top 465 components. Cumulative Time taken in seconds 1270.8352704048157
[I 2023-09-02 20:38:37,441] A new study created in memory with name: no-name-13a6b9df-31a3-4e2f-901f-0a10210ea6fc
Calling fit transform to encode the labels. Cumulative Time taken in seconds 1284.9974687099457 Now running Objective trials using OPTUNA.
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_6460\1782647539.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.4s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 1.4s remaining: 3.8s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 1.4s remaining: 2.0s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 1.4s remaining: 1.1s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 1.5s remaining: 0.5s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 1.5s finished
[I 2023-09-02 20:38:39,402] Trial 0 finished with value: 0.0 and parameters: {'classifier': 'SGD', 'sgd_loss': 'modified_huber', 'sgd_penalty': 'l2', 'sgd_alpha': 2.2866594567639217e-05, 'sgd_max_iter': 48801, 'sgd_tol': 0.03442702325542625}. Best is trial 0 with value: 0.0.
The AUC ROC score is 0.9045356052398305 The AUC ROC score is beyond the threshold 0.9045356052398305
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_6460\1782647539.py:113: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
svc_c = trial.suggest_loguniform('svc_c', 1e-5, 1e10)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.6s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 1.7s remaining: 4.3s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 1.7s remaining: 2.3s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 1.7s remaining: 1.2s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 1.7s remaining: 0.6s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 1.7s finished
[I 2023-09-02 20:38:49,185] Trial 1 finished with value: 0.0 and parameters: {'classifier': 'SVC', 'svc_c': 0.07496137194649308, 'svc_kernel': 'poly', 'svc_degree': 3, 'svc_gamma': 'scale', 'svc_coef0': 0.4459591141825875, 'svc_shrinking': False, 'svc_max_iter': 72290}. Best is trial 0 with value: 0.0.
The AUC ROC score is 0.912119324583621 The AUC ROC score is beyond the threshold 0.912119324583621
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 2.8min
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 2.8min remaining: 7.1min
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 2.8min remaining: 3.8min
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 2.9min remaining: 2.1min
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 2.9min remaining: 1.2min
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 2.9min finished
[I 2023-09-02 21:03:53,336] Trial 2 finished with value: 0.0 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 3861, 'rf_max_depth': 31, 'rf_min_samples_split': 10, 'rf_min_samples_leaf': 9, 'rf_max_features': 'sqrt'}. Best is trial 0 with value: 0.0.
The AUC ROC score is 0.9703985072349103 The AUC ROC score is beyond the threshold 0.9703985072349103
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_6460\1782647539.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.6s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 1.7s remaining: 4.3s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 1.7s remaining: 2.3s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 1.7s remaining: 1.2s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 1.7s remaining: 0.6s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 1.9s finished
[I 2023-09-02 21:03:57,055] Trial 3 finished with value: 0.0 and parameters: {'classifier': 'SGD', 'sgd_loss': 'log_loss', 'sgd_penalty': 'l1', 'sgd_alpha': 0.000249298646524565, 'sgd_max_iter': 68302, 'sgd_tol': 0.08797029851601627}. Best is trial 0 with value: 0.0.
The AUC ROC score is 0.9175296534451466 The AUC ROC score is beyond the threshold 0.9175296534451466
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_6460\1782647539.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.3min
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 1.6min remaining: 4.1min
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 2.3min remaining: 3.1min
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 2.3min remaining: 1.7min
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 2.3min remaining: 56.0s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 2.4min finished
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:713: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:713: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:713: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:713: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:713: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
C:\Users\saketadmin\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\linear_model\_stochastic_gradient.py:713: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
warnings.warn(
[I 2023-09-02 21:22:06,486] Trial 4 finished with value: 0.7779813897493659 and parameters: {'classifier': 'SGD', 'sgd_loss': 'modified_huber', 'sgd_penalty': 'l1', 'sgd_alpha': 4.755011503147148, 'sgd_max_iter': 45061, 'sgd_tol': 0.0847779770009193}. Best is trial 4 with value: 0.7779813897493659.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
The AUC ROC score is 0.7386029463494251 Final Score from the objective function 0.7779813897493659
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 35.1s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 35.1s remaining: 1.5min
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 35.3s remaining: 47.1s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 35.4s remaining: 26.5s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 35.6s remaining: 14.2s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 35.9s finished
[I 2023-09-02 21:27:59,602] Trial 5 finished with value: 0.0 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 1783, 'rf_max_depth': 19, 'rf_min_samples_split': 3, 'rf_min_samples_leaf': 9, 'rf_max_features': 'log2'}. Best is trial 4 with value: 0.7779813897493659.
The AUC ROC score is 0.9420146484068478 The AUC ROC score is beyond the threshold 0.9420146484068478
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_6460\1782647539.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 1.5s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 1.5s remaining: 3.8s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 1.5s remaining: 2.0s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 1.5s remaining: 1.1s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 1.5s remaining: 0.5s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 1.5s finished
[I 2023-09-02 21:28:02,268] Trial 6 finished with value: 0.8975269232867454 and parameters: {'classifier': 'SGD', 'sgd_loss': 'modified_huber', 'sgd_penalty': 'elasticnet', 'sgd_alpha': 0.010472536535417173, 'sgd_max_iter': 16586, 'sgd_tol': 0.09640817691989553}. Best is trial 6 with value: 0.8975269232867454.
The AUC ROC score is 0.8915450746436663 Final Score from the objective function 0.8975269232867454
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_6460\1782647539.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.7s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 0.7s remaining: 1.9s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.7s remaining: 1.0s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.7s remaining: 0.5s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.7s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.8s finished
[I 2023-09-02 21:28:04,480] Trial 7 finished with value: 0.0 and parameters: {'classifier': 'SGD', 'sgd_loss': 'modified_huber', 'sgd_penalty': 'l1', 'sgd_alpha': 9.088258940547194e-05, 'sgd_max_iter': 13859, 'sgd_tol': 0.048834383487103346}. Best is trial 6 with value: 0.8975269232867454.
The AUC ROC score is 0.9215256574411504 The AUC ROC score is beyond the threshold 0.9215256574411504
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_6460\1782647539.py:95: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
sgd_alpha = trial.suggest_loguniform('sgd_alpha', 1e-5, 1e2)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.6s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 0.6s remaining: 1.6s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 0.6s remaining: 0.8s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 0.6s remaining: 0.4s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 0.6s remaining: 0.2s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 0.6s finished
[I 2023-09-02 21:28:05,632] Trial 8 finished with value: 0.0 and parameters: {'classifier': 'SGD', 'sgd_loss': 'modified_huber', 'sgd_penalty': 'l2', 'sgd_alpha': 0.32068891778985226, 'sgd_max_iter': 43306, 'sgd_tol': 0.027546782282779778}. Best is trial 6 with value: 0.8975269232867454.
The AUC ROC score is 0.91103614695164 The AUC ROC score is beyond the threshold 0.91103614695164
C:\Users\saketadmin\AppData\Local\Temp\2\ipykernel_6460\1782647539.py:113: FutureWarning: suggest_loguniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use suggest_float(..., log=True) instead.
svc_c = trial.suggest_loguniform('svc_c', 1e-5, 1e10)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 61 concurrent workers.
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 3.4s
[Parallel(n_jobs=-1)]: Done 2 out of 7 | elapsed: 3.4s remaining: 8.6s
[Parallel(n_jobs=-1)]: Done 3 out of 7 | elapsed: 3.5s remaining: 4.7s
[Parallel(n_jobs=-1)]: Done 4 out of 7 | elapsed: 3.5s remaining: 2.6s
[Parallel(n_jobs=-1)]: Done 5 out of 7 | elapsed: 3.5s remaining: 1.3s
[Parallel(n_jobs=-1)]: Done 7 out of 7 | elapsed: 3.6s finished
[I 2023-09-02 21:28:28,372] Trial 9 finished with value: 0.49974937343358394 and parameters: {'classifier': 'SVC', 'svc_c': 0.000686917625713202, 'svc_kernel': 'rbf', 'svc_degree': 3, 'svc_gamma': 'auto', 'svc_coef0': 0.13028153623356875, 'svc_shrinking': True, 'svc_max_iter': 84442}. Best is trial 6 with value: 0.8975269232867454.
The AUC ROC score is 0.5
Final Score from the objective function 0.49974937343358394
The best hyperparams found: {'classifier': 'SGD', 'sgd_loss': 'modified_huber', 'sgd_penalty': 'elasticnet', 'sgd_alpha': 0.010472536535417173, 'sgd_max_iter': 16586, 'sgd_tol': 0.09640817691989553}
Best Classifier: SGD
Cumulative Time taken in seconds 4275.931656360626
Best hyper-parameters: {'sgd_loss': 'modified_huber', 'sgd_penalty': 'elasticnet', 'sgd_alpha': 0.010472536535417173, 'sgd_max_iter': 16586, 'sgd_tol': 0.09640817691989553}
The params: SGD {'loss': 'modified_huber', 'penalty': 'elasticnet', 'alpha': 0.010472536535417173, 'max_iter': 16586, 'tol': 0.09640817691989553}
Model fit with best model and hyper parameters done!
Now doing predictions
Test accuracy from objective function approach: 0.816
Validation accuracy from objective function approach: 0.875
Now, plotting the output of training and saving models
Training accuracy: 0.94
Training recall: 0.90
Training precision: 0.98
Training AUC-ROC: 0.94
Test accuracy: 0.82
Test recall: 0.90
Test precision: 0.82
Test AUC-ROC: 0.79
Validation accuracy: 0.88
Validation recall: 0.88
Validation precision: 0.88
Validation AUC-ROC: 0.88
Accuracy calculation. Cumulative Time taken in seconds 4276.1216769218445
Visualizing the results
ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for test
Printing how this model fairs for class PNEUMONIA for validation
Confusion Matrix for Test
Confusion Matrix for Validation
Memory Usage: Current memory usage is 106767.730213MB; Peak was 203213.1559MB The model may be overfitting because the difference between the training accuracy (0.94) and validation accuracy (0.82) is 0.13, which is greater than the threshold of 0.05. All completed! Cumulative Time taken in seconds 4279.149178266525
In this section we use GridSearchCV to find the nest params for SGD Optimization algorithm class on the PCA dataset and try to reduce the overfitting we observed earlier.
main_folder_path = './chest_xray'
parent_folder = './'
PCAbasedanalysis(500, parent_folder, 150, 'four', 'SGD',
screeapp = False, gridsrch = True, Strat = False)
Loading features from ./features500
Feature set sizes:
Training: (1000, 4018294)
Test: (624, 4018294)
Validation: (16, 4018294)
Features loaded. Cumulative Time taken in seconds 220.11403226852417
Co-variance matrix calculation. Cumulative Time taken in seconds 223.65370345115662
SVD processing . Cumulative Time taken in seconds 623.9292032718658
Top 150 components. Cumulative Time taken in seconds 623.9292032718658
Now running SVC fit. Cumulative Time taken in seconds 631.8049941062927
Fitting 3 folds for each of 24300 candidates, totalling 72900 fits
Grid search Cumulative Time taken in seconds 940.8744316101074
Now selecting the chosen params based on metrics and retraining the model on those
Top 5 Valid Params
params mean_test_score
6392 {'alpha': 0.1, 'eta0': 0.001, 'l1_ratio': 0.25... 0.925992
6486 {'alpha': 0.1, 'eta0': 0.001, 'l1_ratio': 0.5,... 0.923996
7272 {'alpha': 0.1, 'eta0': 0.001, 'l1_ratio': 0.15... 0.921997
14400 {'alpha': 1e-05, 'eta0': 1e-05, 'l1_ratio': 1.... 0.921991
16785 {'alpha': 1e-05, 'eta0': 0.1, 'l1_ratio': 0.5,... 0.921991
Best parameters: {'alpha': 0.1, 'eta0': 0.001, 'l1_ratio': 0.25, 'learning_rate': 'adaptive', 'max_iter': 50000, 'penalty': 'elasticnet', 'power_t': 0.325, 'tol': 0.001}
Test accuracy: 0.853
Training accuracy: 0.96
Training recall: 0.94
Training precision: 0.99
Training AUC-ROC: 0.96
Test accuracy: 0.85
Test recall: 0.92
Test precision: 0.85
Test AUC-ROC: 0.83
Validation accuracy: 0.94
Validation recall: 1.00
Validation precision: 0.89
Validation AUC-ROC: 0.94
Accuracy calculation. Cumulative Time taken in seconds 943.5457201004028
Visualizing the results
ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for test
Printing how this model fairs for class PNEUMONIA for validation
Confusion Matrix for Test
Confusion Matrix for Validation
Memory Usage: Current memory usage is 58551.269692MB; Peak was 106750.045303MB Loading features from ./features500 Feature set sizes: Training: (1000, 4018294) Test: (624, 4018294) Validation: (16, 4018294) Features loaded. Cumulative Time taken in seconds 22.882405996322632 Co-variance matrix calculation. Cumulative Time taken in seconds 26.58222484588623
main_folder_path = './chest_xray'
parent_folder = './'
PCAbasedanalysis(500, parent_folder, 150, 'four', 'SGD',
screeapp = False, gridsrch = True, Strat = False)
Loading features from ./features500
Feature set sizes:
Training: (1000, 4018294)
Test: (624, 4018294)
Validation: (16, 4018294)
Features loaded. Cumulative Time taken in seconds 23.56836247444153
Co-variance matrix calculation. Cumulative Time taken in seconds 27.29624032974243
SVD processing . Cumulative Time taken in seconds 448.53566431999207
Top 150 components. Cumulative Time taken in seconds 448.53566431999207
Now running SVC fit. Cumulative Time taken in seconds 456.64457392692566
Calling fit transform to encode the labels. Cumulative Time taken in seconds 456.64457392692566
Fitting 3 folds for each of 19440 candidates, totalling 58320 fits
Grid search Cumulative Time taken in seconds 868.5767576694489
Now selecting the chosen params based on metrics and retraining the model on those
Top 5 Valid Params
params mean_test_score
3655 {'alpha': 0.01, 'eta0': 0.1, 'l1_ratio': 0.5, ... 0.923996
11470 {'alpha': 1, 'eta0': 0.001, 'l1_ratio': 0.15, ... 0.922995
4340 {'alpha': 0.01, 'eta0': 0.1, 'l1_ratio': 0.15,... 0.922000
11466 {'alpha': 1, 'eta0': 0.001, 'l1_ratio': 0.15, ... 0.921002
6082 {'alpha': 0.1, 'eta0': 0.001, 'l1_ratio': 0.0,... 0.921002
Best parameters: {'alpha': 0.01, 'eta0': 0.1, 'l1_ratio': 0.5, 'learning_rate': 'adaptive', 'max_iter': 10000, 'penalty': 'l2', 'power_t': 0.325, 'tol': 3.1622776601683795e-05}
Test accuracy: 0.856
Training accuracy: 0.96
Training recall: 0.93
Training precision: 0.99
Training AUC-ROC: 0.96
Test accuracy: 0.86
Test recall: 0.92
Test precision: 0.86
Test AUC-ROC: 0.83
Validation accuracy: 1.00
Validation recall: 1.00
Validation precision: 1.00
Validation AUC-ROC: 1.00
Accuracy calculation. Cumulative Time taken in seconds 870.6085822582245
Visualizing the results
ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for validation
Printing how this model fairs for class PNEUMONIA for test
Confusion Matrix for Test
Confusion Matrix for Validation
Memory Usage: Current memory usage is 58546.935763MB; Peak was 106750.045695MB Loading features from ./features1000
main_folder_path = './chest_xray'
parent_folder = './'
PCAbasedanalysis(1000, parent_folder, 150, 'four', 'SGD',
screeapp = False, gridsrch = True, Strat = False)
Loading features from ./features1000
Feature set sizes:
Training: (2000, 4018294)
Test: (624, 4018294)
Validation: (16, 4018294)
Features loaded. Cumulative Time taken in seconds 341.7417221069336
Co-variance matrix calculation. Cumulative Time taken in seconds 349.3867928981781
SVD processing . Cumulative Time taken in seconds 1443.7333126068115
Top 150 components. Cumulative Time taken in seconds 1443.7333126068115
Now running SVC fit. Cumulative Time taken in seconds 1454.7821698188782
Fitting 3 folds for each of 19440 candidates, totalling 58320 fits
Grid search Cumulative Time taken in seconds 2167.222937822342
Now selecting the chosen params based on metrics and retraining the model on those
Top 5 Valid Params
params mean_test_score
7264 {'alpha': 0.1, 'eta0': 0.1, 'l1_ratio': 0.0, '... 0.93
6149 {'alpha': 0.1, 'eta0': 0.001, 'l1_ratio': 0.25... 0.93
10270 {'alpha': 1, 'eta0': 0.001, 'l1_ratio': 0.25, ... 0.93
15918 {'alpha': 0.001, 'eta0': 0.1, 'l1_ratio': 0.25... 0.93
15391 {'alpha': 0.001, 'eta0': 0.001, 'l1_ratio': 0.... 0.93
Best parameters: {'alpha': 0.01, 'eta0': 1e-05, 'l1_ratio': 1.0, 'learning_rate': 'adaptive', 'max_iter': 10000, 'penalty': 'l2', 'power_t': 0.1, 'tol': 1e-06}
Test accuracy: 0.857
Training accuracy: 0.95
Training recall: 0.91
Training precision: 0.99
Training AUC-ROC: 0.95
Test accuracy: 0.86
Test recall: 0.92
Test precision: 0.86
Test AUC-ROC: 0.84
Validation accuracy: 1.00
Validation recall: 1.00
Validation precision: 1.00
Validation AUC-ROC: 1.00
Accuracy calculation. Cumulative Time taken in seconds 2169.275582551956
Visualizing the results
ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for test
Printing how this model fairs for class PNEUMONIA for validation
Confusion Matrix for Test
Confusion Matrix for Validation
Memory Usage: Current memory usage is 106778.16739MB; Peak was 203213.146007MB
In this section we use GridSearchCV to find the nest params for SVM Optimization algorithm class on the PCA dataset and try to reduce the overfitting we observed earlier.
I have observed that simplified models with simplified HYper Parameter Grid is more likely to build a Generalizable model here.
main_folder_path = './chest_xray'
parent_folder = './'
PCAbasedanalysis(1000, parent_folder, 150, 'four', 'SVM',
screeapp = True, gridsrch = True, Strat = False)
Loading features from ./features1000 Feature set sizes: Training: (2000, 4018294) Test: (624, 4018294) Validation: (16, 4018294) Features loaded. Cumulative Time taken in seconds 349.8340091705322 Co-variance matrix calculation. Cumulative Time taken in seconds 357.46832275390625 SVD processing . Cumulative Time taken in seconds 1452.4162735939026
Ideal number of PCs: 465
Top 465 components. Cumulative Time taken in seconds 1452.9662125110626
Now running SVC fit. Cumulative Time taken in seconds 1466.7768449783325
Fitting 3 folds for each of 23760 candidates, totalling 71280 fits
Grid search Cumulative Time taken in seconds 10279.715799808502
Now selecting the chosen params based on metrics and retraining the model on those
Top 5 Valid Params
params mean_test_score
0 {'C': 1, 'class_weight': 'balanced', 'coef0': ... 0.884994
83 {'C': 1, 'class_weight': 'balanced', 'coef0': ... 0.884994
97 {'C': 1, 'class_weight': 'balanced', 'coef0': ... 0.884994
96 {'C': 1, 'class_weight': 'balanced', 'coef0': ... 0.884994
95 {'C': 1, 'class_weight': 'balanced', 'coef0': ... 0.884994
Best parameters: {'C': 1, 'class_weight': 'balanced', 'coef0': -1.0, 'degree': 1, 'gamma': 'scale', 'kernel': 'rbf', 'probability': True, 'shrinking': True, 'tol': 1e-06}
Test accuracy: 0.793
Training accuracy: 0.92
Training recall: 0.88
Training precision: 0.95
Training AUC-ROC: 0.97
Test accuracy: 0.79
Test recall: 0.86
Test precision: 0.82
Test AUC-ROC: 0.85
Validation accuracy: 0.69
Validation recall: 1.00
Validation precision: 0.62
Validation AUC-ROC: 0.48
Accuracy calculation. Cumulative Time taken in seconds 10284.129793643951
Visualizing the results
ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for test
Printing how this model fairs for class PNEUMONIA for validation
Confusion Matrix for Test
Confusion Matrix for Validation
Memory Usage: Current memory usage is 106786.958054MB; Peak was 203213.146062MB The model may be overfitting because the difference between the training accuracy (0.92) and validation accuracy (0.79) is 0.12, which is greater than the threshold of 0.05.
main_folder_path = './chest_xray'
parent_folder = './'
PCAbasedanalysis(500, parent_folder, 150, 'four', 'SVM',
screeapp = False, gridsrch = True, Strat = False)
Loading features from ./features500
Feature set sizes:
Training: (1000, 4018294)
Test: (624, 4018294)
Validation: (16, 4018294)
Features loaded. Cumulative Time taken in seconds 23.28279447555542
Co-variance matrix calculation. Cumulative Time taken in seconds 27.05053424835205
SVD processing . Cumulative Time taken in seconds 450.7149889469147
Top 150 components. Cumulative Time taken in seconds 450.7149889469147
Now running SVC fit. Cumulative Time taken in seconds 458.69847202301025
Fitting 3 folds for each of 23760 candidates, totalling 71280 fits
Grid search Cumulative Time taken in seconds 838.5881102085114
Now selecting the chosen params based on metrics and retraining the model on those
Top 5 Valid Params
params mean_test_score
0 {'C': 1, 'class_weight': 'balanced', 'coef0': ... 0.856983
83 {'C': 1, 'class_weight': 'balanced', 'coef0': ... 0.856983
97 {'C': 1, 'class_weight': 'balanced', 'coef0': ... 0.856983
96 {'C': 1, 'class_weight': 'balanced', 'coef0': ... 0.856983
95 {'C': 1, 'class_weight': 'balanced', 'coef0': ... 0.856983
Best parameters: {'C': 1, 'class_weight': 'balanced', 'coef0': -1.0, 'degree': 1, 'gamma': 'scale', 'kernel': 'rbf', 'probability': True, 'shrinking': True, 'tol': 1e-06}
Test accuracy: 0.761
Training accuracy: 0.91
Training recall: 0.84
Training precision: 0.97
Training AUC-ROC: 0.96
Test accuracy: 0.76
Test recall: 0.83
Test precision: 0.80
Test AUC-ROC: 0.82
Validation accuracy: 0.56
Validation recall: 0.75
Validation precision: 0.55
Validation AUC-ROC: 0.48
Accuracy calculation. Cumulative Time taken in seconds 841.1943316459656
Visualizing the results
ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for test
Printing how this model fairs for class PNEUMONIA for validation
Confusion Matrix for Test
Confusion Matrix for Validation
Memory Usage: Current memory usage is 58546.233125MB; Peak was 106750.045413MB Loading features from ./features1000
main_folder_path = './chest_xray'
parent_folder = './'
PCAbasedanalysis(500, parent_folder, 150, 'four', 'SVM', False)
Loading features from ./features500 Feature set sizes: Training: (1000, 4018294) Test: (624, 4018294) Validation: (16, 4018294) Features loaded. Cumulative Time taken in seconds 22.756098985671997 Co-variance matrix calculation. Cumulative Time taken in seconds 26.337512254714966 SVD processing . Cumulative Time taken in seconds 436.21798825263977 Top 150 components. Cumulative Time taken in seconds 436.21798825263977 Now running SVC fit. Cumulative Time taken in seconds 444.3569781780243 Singlular SVC fit Cumulative Time taken in seconds 444.7033770084381 Training accuracy: 0.91 Training recall: 0.84 Training precision: 0.97 Training AUC-ROC: 0.96 Test accuracy: 0.76 Test recall: 0.83 Test precision: 0.80 Test AUC-ROC: 0.82 Validation accuracy: 0.56 Validation recall: 0.75 Validation precision: 0.55 Validation AUC-ROC: 0.48 Accuracy calculation. Cumulative Time taken in seconds 444.7662434577942 Visualizing the results ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for validation
Printing how this model fairs for class PNEUMONIA for test
Confusion Matrix for Test
Confusion Matrix for Validation
Memory Usage: Current memory usage is 218072.929353MB; Peak was 266291.56718MB
main_folder_path = './chest_xray'
parent_folder = './'
PCAbasedanalysis(1000, parent_folder, 300, 'four', True)
Loading features from ./features1000
Feature set sizes:
Training: (2000, 4018294)
Test: (624, 4018294)
Validation: (16, 4018294)
Features loaded. Cumulative Time taken in seconds 361.2734169960022
Co-variance matrix calculation. Cumulative Time taken in seconds 369.0726191997528
SVD processing . Cumulative Time taken in seconds 1492.7481708526611
Top 300 components. Cumulative Time taken in seconds 1492.7481708526611
Now running SVC fit. Cumulative Time taken in seconds 1504.5222973823547
Fitting 5 folds for each of 1056 candidates, totalling 5280 fits
Grid search Cumulative Time taken in seconds 1839.1669540405273
Best parameters found during grid search: {'C': 100, 'class_weight': 'balanced', 'coef0': -1.0, 'gamma': 'scale', 'kernel': 'rbf', 'probability': True, 'shrinking': True, 'tol': 1e-06}
Training accuracy: 1.00
Training recall: 1.00
Training precision: 1.00
Training AUC-ROC: 1.00
Test accuracy: 0.75
Test recall: 0.98
Test precision: 0.72
Test AUC-ROC: 0.87
Validation accuracy: 0.81
Validation recall: 1.00
Validation precision: 0.73
Validation AUC-ROC: 0.97
Accuracy calculation. Cumulative Time taken in seconds 1839.6263904571533
Visualizing the results
ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for validation
Printing how this model fairs for class PNEUMONIA for test
Confusion Matrix for Test
Confusion Matrix for Validation
Memory Usage: Current memory usage is 106765.609442MB; Peak was 203213.144855MB
main_folder_path = './chest_xray'
parent_folder = './'
PCAbasedanalysis(1000, parent_folder, 400, 'one', True)
Loading features from ./features1000
Feature set sizes:
Training: (2000, 100352)
Test: (624, 100352)
Validation: (16, 100352)
Features loaded. Cumulative Time taken in seconds 0.6029186248779297
Co-variance matrix calculation. Cumulative Time taken in seconds 0.8162574768066406
SVD processing . Cumulative Time taken in seconds 51.02188682556152
Top 400 components. Cumulative Time taken in seconds 51.02188682556152
Now running SVC fit. Cumulative Time taken in seconds 51.34607553482056
Fitting 5 folds for each of 1056 candidates, totalling 5280 fits
Grid search Cumulative Time taken in seconds 626.0390961170197
Best parameters found during grid search: {'C': 10, 'class_weight': 'balanced', 'coef0': -0.8, 'gamma': 'scale', 'kernel': 'sigmoid', 'probability': True, 'shrinking': True, 'tol': 1e-06}
Training accuracy: 1.00
Training recall: 1.00
Training precision: 1.00
Training AUC-ROC: 1.00
Test accuracy: 0.81
Test recall: 0.99
Test precision: 0.77
Test AUC-ROC: 0.96
Validation accuracy: 1.00
Validation recall: 1.00
Validation precision: 1.00
Validation AUC-ROC: 1.00
Accuracy calculation. Cumulative Time taken in seconds 626.2515969276428
Visualizing the results
ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for validation
Printing how this model fairs for class PNEUMONIA for test
Confusion Matrix for Test
Confusion Matrix for Validation
Memory Usage: Current memory usage is 2691.79439MB; Peak was 5106.325096MB
main_folder_path = './chest_xray'
parent_folder = './'
PCAbasedanalysis(1000, parent_folder, 400, 'one')
Loading features from ./features1000 Feature set sizes: Training: (2000, 100352) Test: (624, 100352) Validation: (16, 100352) Features loaded. Cumulative Time taken in seconds 0.5954890251159668 Co-variance matrix calculation. Cumulative Time taken in seconds 0.801537275314331 SVD 50.3572793006897 Top 400 components. Cumulative Time taken in seconds 50.3572793006897 Now running SVC fit. Cumulative Time taken in seconds 50.65144324302673 Accuracy calculation. Cumulative Time taken in seconds 52.11882495880127 Training accuracy: 1.00 Training recall: 1.00 Training precision: 1.00 Training AUC-ROC: 1.00 Test accuracy: 0.83 Test recall: 0.99 Test precision: 0.80 Test AUC-ROC: 0.96 Validation accuracy: 0.94 Validation recall: 1.00 Validation precision: 0.89 Validation AUC-ROC: 1.00 Visualizing the results ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for validation
Printing how this model fairs for class PNEUMONIA for test
Confusion Matrix for Test
Confusion Matrix for Validation
main_folder_path = './chest_xray'
parent_folder = './'
PCAbasedanalysis(1000, parent_folder, 400, 'two')
Loading features from ./features1000 Feature set sizes: Training: (2000, 100352) Test: (624, 100352) Validation: (16, 100352) Features loaded. Cumulative Time taken in seconds 4.171685695648193 Co-variance matrix calculation. Cumulative Time taken in seconds 4.381356716156006 SVD 54.02369427680969 Top 400 components. Cumulative Time taken in seconds 54.02369427680969 Now running SVC fit. Cumulative Time taken in seconds 54.32897210121155 Accuracy calculation. Cumulative Time taken in seconds 55.62001657485962 Training accuracy: 1.00 Training recall: 1.00 Training precision: 1.00 Training AUC-ROC: 1.00 Test accuracy: 0.84 Test recall: 0.98 Test precision: 0.81 Test AUC-ROC: 0.96 Validation accuracy: 0.88 Validation recall: 1.00 Validation precision: 0.80 Validation AUC-ROC: 1.00 Visualizing the results ROC AUC for Test
ROC AUC for Validation
Printing how this model fairs for class PNEUMONIA for validation
Printing how this model fairs for class PNEUMONIA for test
Confusion Matrix for Test
Confusion Matrix for Validation
NON-PCA method for feature selection & the models created with this method mostly overfit for this dataset
as of 9/1/23
howmanyimages() ## To proce we are using originals for Test and Validation
# Using the complex feature combination here - C-HIST, LBP, HOG, CONTOUR, EDGE
main_folder_path = './chest_xray'
parent_folder = './'
mlfeaturizationandtraining(500, main_folder_path, parent_folder, gridsrch = True, load = True,
tne = True, whichftr = 'two', optslist = 'RF',
simplifyfeatures = True, pct = 0.90)
For Training
Number of files per class in the main folder: {'NORMAL': 1344, 'PNEUMONIA': 3874}. We will pick a few out of these for training, test & validation
For Test
Number of files per class in the main folder: {'NORMAL': 234, 'PNEUMONIA': 390}. We will pick a few out of these for training, test & validation
For Validation
Number of files per class in the main folder: {'NORMAL': 8, 'PNEUMONIA': 8}. We will pick a few out of these for training, test & validation
Loading saved features
Train & Evaluate section with a choice of 6 optimizers & 4 features ==>
Doing feature selection using Variance & K Best
Original feature count 100352
Cumulative Time taken: 4.422534227371216
X_train selected shape: (1000, 90316)
X_test selected shape: (624, 90316)
X_val selected shape: (16, 90316)
Training with Random Forest..
Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Random Forest:
Accuracy train: 1.0
Accuracy test: 0.7980769230769231
Accuracy val: 0.875
AUC ROC tests: 0.9318485645408723
AUC ROC val: 1.0
Precision test: 0.813767459528958 Recall test: 0.7980769230769231 Precision val: 0.9 Recall val: 0.875 Printing how this model fairs for class PNEUMONIA
Plot the gridsearch results for Random Forest
Cumulative Time taken in seconds was 206.93473887443542 Cumulative Time taken: 212.6422460079193 Cumulative Time taken so far (seconds): 212.6422460079193 All Done!!
howmanyimages() ## To proce we are using originals for Test and Validation
# Using the complex feature combination here - C-HIST, LBP, HOG, CONTOUR, EDGE
main_folder_path = './chest_xray'
parent_folder = './'
mlfeaturizationandtraining(500, main_folder_path, parent_folder, gridsrch = True, load = True,
tne = True, whichftr = 'four', optslist = 'RF',
simplifyfeatures = True, pct = 0.50)
For Training
Number of files per class in the main folder: {'NORMAL': 1344, 'PNEUMONIA': 3874}. We will pick a few out of these for training, test & validation
For Test
Number of files per class in the main folder: {'NORMAL': 234, 'PNEUMONIA': 390}. We will pick a few out of these for training, test & validation
For Validation
Number of files per class in the main folder: {'NORMAL': 8, 'PNEUMONIA': 8}. We will pick a few out of these for training, test & validation
Loading saved features
Train & Evaluate section with a choice of 6 optimizers & 4 features ==>
Doing feature selection using Variance & K Best
Original feature count 4018294
X_train selected shape: (1000, 2009147)
X_test selected shape: (624, 2009147)
X_val selected shape: (16, 2009147)
Color histogram/lbp/hog/contour/edge features with test/val dataset with 500 samples
Training with Random Forest..
Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best parameters: {'max_depth': 30, 'max_features': 'sqrt', 'min_samples_leaf': 8, 'min_samples_split': 2, 'n_estimators': 100}
Random Forest:
Accuracy train: 0.99
Accuracy test: 0.7788461538461539
Accuracy val: 0.8125
AUC ROC tests: 0.8640258601797063
AUC ROC val: 0.96875
Precision test: 0.8081747839485497 Recall test: 0.7788461538461539 Precision val: 0.8636363636363636 Recall val: 0.8125 Printing how this model fairs for class PNEUMONIA
Plot the gridsearch results for Random Forest
Cumulative Time taken in seconds was 4330.342844486237 Cumulative Time taken so far (seconds): 4697.096110582352 All Done!!
howmanyimages() ## To proce we are using originals for Test and Validation
# Using the complex feature combination here - C-HIST, LBP, HOG, CONTOUR, EDGE
main_folder_path = './chest_xray'
parent_folder = './'
mlfeaturizationandtraining(500, main_folder_path, parent_folder, gridsrch = True, load = True,
tne = True, whichftr = 'four', optslist = 'KNN',
simplifyfeatures = True, pct = 0.40)
For Training
Number of files per class in the main folder: {'NORMAL': 1344, 'PNEUMONIA': 3874}. We will pick a few out of these for training, test & validation
For Test
Number of files per class in the main folder: {'NORMAL': 234, 'PNEUMONIA': 390}. We will pick a few out of these for training, test & validation
For Validation
Number of files per class in the main folder: {'NORMAL': 8, 'PNEUMONIA': 8}. We will pick a few out of these for training, test & validation
Loading saved features
Train & Evaluate section with a choice of 6 optimizers & 4 features ==>
Doing feature selection using Variance & K Best
Original feature count 4018294
X_train selected shape: (1000, 1607317)
X_test selected shape: (624, 1607317)
X_val selected shape: (16, 1607317)
Color histogram/lbp/hog/contour/edge features with test/val dataset with 500 samples
Training with KNN..
Fitting 5 folds for each of 2 candidates, totalling 10 fits
k-Nearest Neighbors:
Best parameters: {'algorithm': 'kd_tree', 'leaf_size': 20, 'metric': 'minkowski', 'n_neighbors': 30, 'p': 2, 'weights': 'distance'}
Accuracy train: 1.0
Accuracy test: 0.6987179487179487
Accuracy val: 0.5
Precision: 0.6898467432950193
Recall: 0.6987179487179487
Precision: 0.5
Recall: 0.5
AUC ROC tests: 0.7372452333990795
AUC ROC val: 0.671875
Printing how this model fairs for class PNEUMONIA
Cumulative Time taken was 5602.360122919083 Cumulative Time taken so far (seconds): 5954.966990232468 All Done!!
howmanyimages() ## To proce we are using originals for Test and Validation
# Using the complex feature combination here - C-HIST, LBP, HOG, CONTOUR, EDGE
main_folder_path = './chest_xray'
parent_folder = './'
mlfeaturizationandtraining(500, main_folder_path, parent_folder, gridsrch = False, load = True,
tne = True, whichftr = 'four', optslist = 'SVM,RF',
simplifyfeatures = True, pct = 0.55)
For Training
Number of files per class in the main folder: {'NORMAL': 1344, 'PNEUMONIA': 3874}. We will pick a few out of these for training, test & validation
For Test
Number of files per class in the main folder: {'NORMAL': 234, 'PNEUMONIA': 390}. We will pick a few out of these for training, test & validation
For Validation
Number of files per class in the main folder: {'NORMAL': 8, 'PNEUMONIA': 8}. We will pick a few out of these for training, test & validation
100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 397.45it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1992.54it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1998.72it/s]
Sizes of Train, Test & Validation arrays 1000 624 16 Loading saved features
Train & Evaluate section with a choice of 6 optimizers & 4 features ==> Doing feature selection using Variance & K Best Original feature count 4018294 X_train selected shape: (1000, 2210061) X_test selected shape: (624, 2210061) X_val selected shape: (16, 2210061) Color histogram/lbp/hog/contour/edge features with test/val dataset with 500 samples Training with SVM/SVC.. Support Vector Machines: Accuracy train: 1.0 Accuracy test: 0.7836538461538461 Accuracy val: 0.75 AUC ROC test: 0.8254766600920446 AUC ROC val: 1.0
Precision test: 0.8033364043169254 Recall test: 0.7836538461538461 Precision val: 0.8333333333333333 Recall val: 0.75 Printing how this model fairs for class PNEUMONIA
Cumulative Time taken in seconds was 3436.881238222122 Training with Random Forest.. Random Forest: Accuracy test: 0.7740384615384616 Accuracy val: 0.8125 AUC ROC test: 0.8802432610124917 AUC ROC val: 0.9453125
Precision test: 0.7970826639404118 Recall test: 0.7740384615384616 Precision val: 0.8636363636363636 Recall val: 0.8125 Printing how this model fairs for class PNEUMONIA
Cumulative Time taken in seconds was 3575.6503541469574 Cumulative Time taken so far (seconds): 3937.752229452133 All Done!!
howmanyimages() ## To proce we are using originals for Test and Validation
# Using the complex feature combination here - C-HIST, LBP, HOG, CONTOUR, EDGE
main_folder_path = './chest_xray'
parent_folder = './'
mlfeaturizationandtraining(100, main_folder_path, parent_folder, gridsrch = False, load = True,
tne = True, whichftr = 'four', optslist = 'LR,SVM,KNN,RF',
simplifyfeatures = True, pct = 0.85, color_space_clhce = 'BGR')
For Training
Number of files per class in the main folder: {'NORMAL': 1344, 'PNEUMONIA': 3874}. We will pick a few out of these for training, test & validation
For Test
Number of files per class in the main folder: {'NORMAL': 234, 'PNEUMONIA': 390}. We will pick a few out of these for training, test & validation
For Validation
Number of files per class in the main folder: {'NORMAL': 8, 'PNEUMONIA': 8}. We will pick a few out of these for training, test & validation
100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 665.45it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1993.49it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1989.71it/s]
Sizes of Train, Test & Validation arrays 200 200 16 Loading saved features
Train & Evaluate section with a choice of 6 optimizers & 4 features ==> Doing feature selection using Variance & K Best Original feature count 4018294 X_train selected shape: (200, 3415549) X_test selected shape: (200, 3415549) X_val selected shape: (16, 3415549) Color histogram/lbp/hog/contour/edge features with test/val dataset with 100 samples Logistic Regression training: Accuracy test: 0.71 Accuracy val: 0.75 AUC ROC tests: 0.7781 AUC ROC val: 0.6953125
Precision test: 0.7878289473684211 Recall test: 0.71 Precision val: 0.8333333333333333 Recall val: 0.75 Printing how this model fairs for class PNEUMONIA
Cumulative Time taken in seconds was 41.98141622543335 Training with SVM/SVC.. Support Vector Machines: Accuracy test: 0.785 Accuracy val: 0.75 AUC ROC test: 0.82765 AUC ROC val: 0.890625
Precision test: 0.804 Recall test: 0.785 Precision val: 0.8333333333333333 Recall val: 0.75 Printing how this model fairs for class PNEUMONIA
Cumulative Time taken in seconds was 287.0135793685913 Training with Random Forest.. Random Forest: Accuracy test: 0.775 Accuracy val: 0.875 AUC ROC test: 0.9107 AUC ROC val: 0.9609375
Precision test: 0.8186189317576179 Recall test: 0.775 Precision val: 0.9 Recall val: 0.875 Printing how this model fairs for class PNEUMONIA
Cumulative Time taken in seconds was 309.3881187438965 Cumulative Time taken so far (seconds): 360.23172068595886 All Done!!
howmanyimages() ## To proce we are using originals for Test and Validation
# Using the complex feature combination here - C-HIST, LBP, HOG, CONTOUR, EDGE
main_folder_path = './chest_xray'
parent_folder = './'
mlfeaturizationandtraining(100, main_folder_path, parent_folder, gridsrch = False, load = True,
tne = True, whichftr = 'four', optslist = 'LR,SVM,KNN',
simplifyfeatures = True, pct = 0.65, color_space_clhce = 'BGR')
For Training
Number of files per class in the main folder: {'NORMAL': 1344, 'PNEUMONIA': 3874}. We will pick a few out of these for training, test & validation
For Test
Number of files per class in the main folder: {'NORMAL': 234, 'PNEUMONIA': 390}. We will pick a few out of these for training, test & validation
For Validation
Number of files per class in the main folder: {'NORMAL': 8, 'PNEUMONIA': 8}. We will pick a few out of these for training, test & validation
100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 499.00it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1994.44it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2002.05it/s]
Sizes of Train, Test & Validation arrays 200 200 16 Loading saved features
Train & Evaluate section with a choice of 6 optimizers & 4 features ==> Doing feature selection using Variance & K Best Original feature count 4018294 X_train selected shape: (200, 2611891) X_test selected shape: (200, 2611891) X_val selected shape: (16, 2611891) Color histogram/lbp/hog/contour/edge features with test/val dataset with 100 samples Logistic Regression training: Accuracy test: 0.695 Accuracy val: 0.6875 AUC ROC tests: 0.74665 AUC ROC val: 0.65625
Precision test: 0.7635491282605759 Recall test: 0.695 Precision val: 0.8076923076923077 Recall val: 0.6875 Printing how this model fairs for class PNEUMONIA
Cumulative Time taken in seconds was 29.049189805984497 Training with SVM/SVC.. Support Vector Machines: Accuracy test: 0.765 Accuracy val: 0.75 AUC ROC test: 0.8282 AUC ROC val: 0.921875
Precision test: 0.7893328966044328 Recall test: 0.765 Precision val: 0.8333333333333333 Recall val: 0.75 Printing how this model fairs for class PNEUMONIA
Cumulative Time taken in seconds was 228.7582495212555 Cumulative Time taken so far (seconds): 276.4923017024994 All Done!!
This section needs to be run only one time to generate the augmented dataset and then save it to the file system and gdrive.
import os
main_folder = './skin_cancer_dataset_nohair_more_data'
class_names = ['akiec', 'bcc', 'bkl', 'df', 'PNEUMONIA', 'nv', 'vasc']
files_per_subfolder = {}
for class_name in class_names:
subfolder_path = os.path.join(main_folder, class_name)
if os.path.isdir(subfolder_path):
files_per_subfolder[class_name] = len([name for name in os.listdir(subfolder_path) if os.path.isfile(os.path.join(subfolder_path, name))])
print(f'Number of files per subfolder: {files_per_subfolder}')
Number of files per subfolder: {'akiec': 327, 'bcc': 514, 'bkl': 1099, 'df': 115, 'mel': 1113, 'nv': 3000, 'vasc': 142}
import os
import glob
import imageio
import cv2
from scipy import ndimage
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import numpy as np
def is_duplicate(image, images):
"""Check if an image is a duplicate of another image in a list of images."""
for im in images:
if np.array_equal(image, im):
return True
return False
def flip_images(images, flip_code, class_name, set_name):
"""Flip the images and save them to disk in batches."""
flipped_images = []
unique_images = []
batch_size = 500 #len(images)
try:
for j in range(0, len(images), batch_size):
batch = images[j:j+batch_size]
flipped_batch = [cv2.flip(im, flip_code) for im in tqdm(batch)]
flipped_images.extend(flipped_batch)
# Save the augmented images to disk after every batch
if flip_code == -1:
flip_code = 2
os.makedirs(f'{target_folder}/{class_name}/{set_name}/augmented/flipped_{flip_code}', exist_ok=True)
k = 0
for im in flipped_batch:
# Convert the pixel values to integers and clip them to the range [0, 255]
im = np.clip(im, 0, 255).astype(np.uint8)
if not is_duplicate(im, unique_images):
imageio.imwrite( \
f'{target_folder}/{class_name}/{set_name}/augmented/flipped_{flip_code}/{k}.jpg', im)
unique_images.append(im)
k += 1
# Discard the memory usage after each batch
del flipped_batch
except Exception as e:
print(str(e))
def crop_image(image):
"""Crop an image to remove black borders."""
# Convert the image to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Create a binary mask where pixels with a value greater than 0 are set to 255
_, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY)
# Find the contours in the binary mask
contours, _ = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# Find the bounding rectangle for the largest contour
x, y, w, h = cv2.boundingRect(max(contours, key=cv2.contourArea))
# Crop the image using the bounding rectangle
cropped = image[y:y+h, x:x+w]
return cropped
def rotate_and_crop_image(image, angle):
"""Rotate and crop an image to remove black borders."""
# Compute the size of the border needed to contain all the image data after rotation
h, w = image.shape[:2]
diagonal = int(np.ceil(np.sqrt(h**2 + w**2)))
pad_h = (diagonal - h) // 2
pad_w = (diagonal - w) // 2
# Pad the image with a border of a constant value
padded = cv2.copyMakeBorder(image, pad_h, pad_h, pad_w, pad_w, cv2.BORDER_CONSTANT, value=0)
# Rotate the padded image
rotated = ndimage.rotate(padded, angle)
# Crop the rotated image to remove black borders
cropped = crop_image(rotated)
return cropped
def rotate_images(images, class_name, set_name):
"""Rotate and crop images and save them to disk in batches."""
rotated_images = []
unique_images = []
batch_size = 500 #len(images)
print("Rotating & cropping")
for j in range(0, len(images), batch_size):
batch = images[j:j+batch_size]
if class_name in ['akiec', 'bcc','df', 'vasc']:
rotated_batch = [rotate_and_crop_image(im, angle) for im in tqdm(batch)
for angle in range(0, 360, 180)]
else:
rotated_batch = [rotate_and_crop_image(im, angle) for im in tqdm(batch)
for angle in range(0, 360, 360)]
rotated_images.extend(rotated_batch)
# Save the augmented images to disk after every batch
os.makedirs(f'{target_folder}/{class_name}/{set_name}/augmented/rotated', exist_ok=True)
k = 0
for im in rotated_batch:
# Convert the pixel values to integers and clip them to the range [0, 255]
im = np.clip(im, 0, 255).astype(np.uint8)
if not is_duplicate(im, unique_images):
imageio.imwrite(f'{target_folder}/{class_name}/{set_name}/augmented/rotated/{k}.jpg', im)
unique_images.append(im)
k += 1
# Discard the memory usage after each batch
del rotated_batch
def add_noise_to_images(images, class_name, set_name):
"""Add Gaussian noise to the images and save them to disk in batches."""
noisy_images = []
unique_images = []
batch_size = 500 #len(images)
for j in range(0, len(images), batch_size):
batch = images[j:j+batch_size]
if class_name in ['bkl','PNEUMONIA']:
noisy_batch = [im + np.random.normal(0,std ,im.shape)
for im in tqdm(batch) for std in range(10 ,20 ,10)]
elif class_name in ['akiec', 'bcc']:
noisy_batch = [im + np.random.normal(0,std ,im.shape)
for im in tqdm(batch) for std in range(10 ,100 ,10)]
elif class_name in ['df', 'vasc']:
noisy_batch = [im + np.random.normal(0,std ,im.shape)
for im in tqdm(batch) for std in range(10 ,90 ,5)]
print("Adding rotation to noisy images for some classes")
#noisy_and_rotated_batch = [rotate_and_crop_image(im, angle) for im in tqdm(noisy_batch)
# for angle in range(0, 360, 180)]
xl = int(len(noisy_batch)/2)
#print(len(noisy_batch)/2, type(noisy_batch))
noisy_flipped_batch_1 = [cv2.flip(im, 1) for im in tqdm(noisy_batch[:xl])]
noisy_flipped_batch_11 = [cv2.flip(im, -1) for im in tqdm(noisy_batch[xl:-1])]
else:
return
if class_name in ['df', 'vasc']:
noisy_images.extend(noisy_flipped_batch_1)
noisy_images.extend(noisy_flipped_batch_11)
noisy_images.extend(noisy_batch)
else:
noisy_images.extend(noisy_batch)
# Save the augmented images to disk after every batch
os.makedirs(f'{target_folder}/{class_name}/{set_name}/augmented/noisy', exist_ok=True)
k = 0
for im in noisy_images:
# Convert the pixel values to integers and clip them to the range [0, 255]
im = np.clip(im, 0, 255).astype(np.uint8)
if not is_duplicate(im, unique_images):
imageio.imwrite(f'{target_folder}/{class_name}/{set_name}/augmented/noisy/{k}.jpg', im)
unique_images.append(im)
k += 1
# Discard the memory usage after each batch
del noisy_batch
del noisy_images
def augmentation()
"""
This function augments the data images so that we can create a class balance
the data from Kaggle is very imbalanced across classes
We can augment using Flipping, Rotating & ading Guassian noise
Typically we would run this on a dataset where we have removed all hair
Hair is visible in most images since this is a skin cancer dataset
"""
## ---->> Low On resources, so splitting the work in to parts & going sequentially
#class_names = ['akiec', 'bcc']
#class_names = ['nv']
#class_names = ['PNEUMONIA','bkl']
#class_names = ['df', 'vasc']
class_names = []
main_folder = './skin_cancer_dataset_nohair_more_data'
target_folder = './skin_cancer_ML_dataset_nohair_more_data'
batch_size = 200
X_train = []
y_train = []
X_test = []
y_test = []
X_val = []
y_val = []
for i, class_name in enumerate(class_names):
print(f'Processing class {class_name}...')
images = []
#image_names = []
for filename in glob.glob(f'{main_folder}/{class_name}/*.jpg'):
im = imageio.imread(filename)
images.append((im , filename.split('/')[-1]))
#image_names.append()
print(f'Splitting data into train, test, and validation sets...')
# Split the data into train, test, and validation sets with a 60-20-20 split per class
X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(images,
[i] * len(images),
test_size=0.2, random_state=0)
X_train_class, X_val_class, y_train_class, y_val_class = train_test_split(X_train_class, y_train_class,
test_size=0.25, random_state=0)
# Save the images to disk after the first stage of train-test-validation split
os.makedirs(f'{target_folder}/{class_name}/train/original', exist_ok=True)
os.makedirs(f'{target_folder}/{class_name}/test/original', exist_ok=True)
os.makedirs(f'{target_folder}/{class_name}/val/original', exist_ok=True)
for im, im_n in X_train_class:
imageio.imwrite(f'{target_folder}/{class_name}/train/original/{im_n}.jpg', np.array(im))
for im, im_n in X_test_class:
imageio.imwrite(f'{target_folder}/{class_name}/test/original/{im_n}.jpg', np.array(im))
for im, im_n in X_val_class:
imageio.imwrite(f'{target_folder}/{class_name}/val/original/{im_n}.jpg', np.array(im))
print(f'Augmenting data using image flipping and rotation with different angles and adding Gaussian noise \
with different levels of intensity...')
# Load the images from the saved location to perform the 5 steps of image augmentations on them
X_train_class = [imageio.imread(filename) for filename in
glob.glob(f'{target_folder}/{class_name}/train/original/*.jpg')]
X_test_class = [imageio.imread(filename) for filename in
glob.glob(f'{target_folder}/{class_name}/test/original/*.jpg')]
X_val_class = [imageio.imread(filename) for filename in
glob.glob(f'{target_folder}/{class_name}/val/original/*.jpg')]
# Augment the data in each set using image flipping and rotation with different angles and adding Gaussian
# noise with different levels of intensity
# Save the images to disk after every block of operations on X_train
X_train.extend(X_train_class)
flip_images(X_train_class, 0, class_name, 'train')
if class_name != "nv":
flip_images(X_train_class, 1, class_name, 'train')
flip_images(X_train_class, -1, class_name, 'train')
add_noise_to_images(X_train_class, class_name, 'train')
#rotate_images(X_train_class, class_name, 'train')
# Save the images to disk after every block of operations on X_test
X_test.extend(X_test_class)
flip_images(X_test_class, 0, class_name, 'test')
if class_name != "nv":
flip_images(X_test_class, 1, class_name, 'test')
flip_images(X_test_class, -1, class_name, 'test')
#rotate_images(X_test_class, class_name, 'test')
add_noise_to_images(X_test_class, class_name, 'test')
# Save the images to disk after every block of operations on X_val
X_val.extend(X_val_class)
flip_images(X_val_class, 0, class_name, 'val')
if class_name != "nv":
flip_images(X_val_class, 1, class_name, 'val')
flip_images(X_val_class, -1, class_name, 'val')
#rotate_images(X_val_class, class_name, 'val')
add_noise_to_images(X_val_class, class_name, 'val')
print("Done All!")
## call the main function
augmentation()
Processing class df...
/var/folders/dy/y7y5gds57930wyf5hns2qcjw0000gn/T/ipykernel_17141/1758442644.py:177: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly. im = imageio.imread(filename)
Splitting data into train, test, and validation sets... Augmenting data using image flipping and rotation with different angles and adding Gaussian noise with different levels of intensity...
/var/folders/dy/y7y5gds57930wyf5hns2qcjw0000gn/T/ipykernel_17141/1758442644.py:208: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly. X_train_class = [imageio.imread(filename) for filename in /var/folders/dy/y7y5gds57930wyf5hns2qcjw0000gn/T/ipykernel_17141/1758442644.py:211: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly. X_test_class = [imageio.imread(filename) for filename in /var/folders/dy/y7y5gds57930wyf5hns2qcjw0000gn/T/ipykernel_17141/1758442644.py:213: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly. X_val_class = [imageio.imread(filename) for filename in 100%|█████████████████████████████████████████| 69/69 [00:00<00:00, 4730.96it/s] 100%|█████████████████████████████████████████| 69/69 [00:00<00:00, 5599.44it/s] 100%|█████████████████████████████████████████| 69/69 [00:00<00:00, 5565.84it/s] 100%|███████████████████████████████████████████| 69/69 [00:21<00:00, 3.22it/s]
Adding rotation to noisy images for some classes
100%|████████████████████████████████████████| 552/552 [00:01<00:00, 316.88it/s] 100%|████████████████████████████████████████| 551/551 [00:02<00:00, 272.88it/s] 100%|█████████████████████████████████████████| 23/23 [00:00<00:00, 6817.11it/s] 100%|█████████████████████████████████████████| 23/23 [00:00<00:00, 3937.67it/s] 100%|█████████████████████████████████████████| 23/23 [00:00<00:00, 4535.88it/s] 100%|███████████████████████████████████████████| 23/23 [00:08<00:00, 2.76it/s]
Adding rotation to noisy images for some classes
100%|████████████████████████████████████████| 184/184 [00:00<00:00, 526.08it/s] 100%|████████████████████████████████████████| 183/183 [00:00<00:00, 286.42it/s] 100%|█████████████████████████████████████████| 23/23 [00:00<00:00, 7511.41it/s] 100%|█████████████████████████████████████████| 23/23 [00:00<00:00, 5680.66it/s] 100%|█████████████████████████████████████████| 23/23 [00:00<00:00, 5306.91it/s] 100%|███████████████████████████████████████████| 23/23 [00:07<00:00, 2.93it/s]
Adding rotation to noisy images for some classes
100%|████████████████████████████████████████| 184/184 [00:00<00:00, 497.96it/s] 100%|████████████████████████████████████████| 183/183 [00:00<00:00, 286.38it/s]
Processing class vasc... Splitting data into train, test, and validation sets... Augmenting data using image flipping and rotation with different angles and adding Gaussian noise with different levels of intensity...
100%|█████████████████████████████████████████| 84/84 [00:00<00:00, 5972.26it/s] 100%|█████████████████████████████████████████| 84/84 [00:00<00:00, 5760.75it/s] 100%|█████████████████████████████████████████| 84/84 [00:00<00:00, 5817.15it/s] 100%|███████████████████████████████████████████| 84/84 [00:31<00:00, 2.70it/s]
Adding rotation to noisy images for some classes
100%|████████████████████████████████████████| 672/672 [00:02<00:00, 302.47it/s] 100%|████████████████████████████████████████| 671/671 [00:02<00:00, 260.00it/s] 100%|█████████████████████████████████████████| 29/29 [00:00<00:00, 1297.36it/s] 100%|█████████████████████████████████████████| 29/29 [00:00<00:00, 4020.45it/s] 100%|█████████████████████████████████████████| 29/29 [00:00<00:00, 6368.31it/s] 100%|███████████████████████████████████████████| 29/29 [00:09<00:00, 3.10it/s]
Adding rotation to noisy images for some classes
100%|████████████████████████████████████████| 232/232 [00:00<00:00, 618.09it/s] 100%|████████████████████████████████████████| 231/231 [00:00<00:00, 330.78it/s] 100%|█████████████████████████████████████████| 29/29 [00:00<00:00, 1186.67it/s] 100%|█████████████████████████████████████████| 29/29 [00:00<00:00, 5935.43it/s] 100%|█████████████████████████████████████████| 29/29 [00:00<00:00, 3964.89it/s] 100%|███████████████████████████████████████████| 29/29 [00:08<00:00, 3.40it/s]
Adding rotation to noisy images for some classes
100%|████████████████████████████████████████| 232/232 [00:00<00:00, 608.27it/s] 100%|████████████████████████████████████████| 231/231 [00:00<00:00, 334.31it/s]
Done All!
## CURRENT CODE
# Define the base path
base_path = './skin_cancer_ML_dataset_nohair_more_data'
# Define the folder names
folder_names = ['akiec', 'bcc', 'bkl', 'df', 'PNEUMONIA', 'nv', 'vasc']
subfolder_names = ['train', 'test', 'val']
subfolder_map = {'train': 'Training', 'test': 'Test', 'val': 'Validation'}
augmented_subfolders = ['flipped_0', 'flipped_1', 'flipped_2', 'noisy']
# Initialize the counters
original_counter = 0
copy_counter = 0
# Iterate over the folder names
for folder_name in folder_names:
for subfolder_name in subfolder_names:
# Define the source and destination paths for the original subfolder
src_path = os.path.join(base_path, folder_name, subfolder_name, 'original')
dst_path = os.path.join(base_path, subfolder_map[subfolder_name], folder_name)
# Create the destination directory if it doesn't exist
os.makedirs(dst_path, exist_ok=True)
# Count the number of files in the source directory
original_counter += len(os.listdir(src_path))
# Copy and rename the files from the source to the destination
for file_name in os.listdir(src_path):
new_file_name = subfolder_name + '_original_' + file_name
shutil.copy2(os.path.join(src_path, file_name), os.path.join(dst_path, new_file_name))
copy_counter += 1
# Iterate over the augmented subfolders
for augmented_subfolder in augmented_subfolders:
# Define the source and destination paths for the augmented subfolders
src_path = os.path.join(base_path, folder_name, subfolder_name, 'augmented', augmented_subfolder)
dst_path = os.path.join(base_path, subfolder_map[subfolder_name], folder_name)
# Create the destination directory if it doesn't exist
os.makedirs(dst_path, exist_ok=True)
# Count the number of files in the source directory
original_counter += len(os.listdir(src_path))
# Copy and rename the files from the source to the destination
for file_name in os.listdir(src_path):
new_file_name = subfolder_name + '_augmented_' + augmented_subfolder + '_' + file_name
shutil.copy2(os.path.join(src_path, file_name), os.path.join(dst_path, new_file_name))
copy_counter += 1
# Print the number of files copied and check if it matches the original number of files
print(f'Copied {copy_counter} files.')
if copy_counter == original_counter:
print('All files were successfully copied.')
else:
print('Warning: The number of files copied does not match the original number of files.')
Copied 34435 files. All files were successfully copied.
import os
import shutil
# Define the base path
base_path = './skin_cancer_ML_dataset'
# Define the folder names
folder_names = ['akiec', 'bcc', 'bkl', 'df', 'PNEUMONIA', 'nv', 'vasc']
subfolder_names = ['Training', 'Test', 'Validation']
augmented_subfolders = ['flipped_0', 'flipped_1', 'flipped_2', 'noisy']
# Initialize the counters
original_counter = 0
copy_counter = 0
# Iterate over the subfolder names
for subfolder_name in subfolder_names:
# Iterate over the folder names
for folder_name in folder_names:
# Define the source and destination paths for the original subfolder
src_path = os.path.join(base_path, subfolder_name, folder_name, 'original')
dst_path = os.path.join(base_path, subfolder_name, folder_name)
# Create the destination directory if it doesn't exist
os.makedirs(dst_path, exist_ok=True)
# Count the number of files in the source directory
original_counter += len(os.listdir(src_path))
# Copy and rename the files from the source to the destination
for file_name in os.listdir(src_path):
new_file_name = folder_name + '_original_' + file_name
shutil.copy2(os.path.join(src_path, file_name), os.path.join(dst_path, new_file_name))
copy_counter += 1
# Iterate over the augmented subfolders
for augmented_subfolder in augmented_subfolders:
# Define the source and destination paths for the augmented subfolders
src_path = os.path.join(base_path, subfolder_name, folder_name, 'augmented', augmented_subfolder)
dst_path = os.path.join(base_path, subfolder_name, folder_name)
# Create the destination directory if it doesn't exist
os.makedirs(dst_path, exist_ok=True)
# Count the number of files in the source directory
original_counter += len(os.listdir(src_path))
# Copy and rename the files from the source to the destination
for file_name in os.listdir(src_path):
new_file_name = folder_name + '_augmented_' + augmented_subfolder + '_' + file_name
shutil.copy2(os.path.join(src_path, file_name), os.path.join(dst_path, new_file_name))
copy_counter += 1
# Print the number of files copied and check if it matches the original number of files
print(f'Copied {copy_counter} files.')
if copy_counter == original_counter:
print('All files were successfully copied.')
else:
print('Warning: The number of files copied does not match the original number of files.')
Copied 26752 files. All files were successfully copied.
import os
def print_folder_structure(start_path='./skin_cancer_ML_dataset'):
for root, dirs, files in os.walk(start_path):
level = root.replace(start_path, '').count(os.sep)
indent = ' ' * 4 * level
print(f'{indent}{os.path.basename(root)}/')
print_folder_structure()